Abstract
This is our study on crime. Crime does not pay. Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.In this report, we seek to examine and discuss determinants of crime and offer recommend actionable policy recommendations for local politicians running for election at the county level. For our analysis, we draw on sample data collected from a study by Cornwell and Trumball, researchers from the University of Georgia and West Virginia University. Our sample data includes data on crime rates, arrests, sentences, demographics, local weekly wages, tax revenues and more drawn from local and federal government data sources. Although the age of the data may be a potential limitation of our study, we believe the insights we gather and policy recommendations remain appropriate for local campaigns today.
Our primary question that will drive our data exploration are to ask which variables affect crime rate the most.
The crime_v2 dataset provided includes 25 variables of interest.
We include them below for reference by category of interest.
| Category | Variable |
|---|---|
| Crime Rate | crmrte |
| Geographic | county, west, central |
| Demographic | urban, density, pctmin80, pctymle |
| Economic - Wage | wcon, wtuc, wtrd, wfir, wser, wmfg, wfed, wsta, wloc |
| Economic - Revenue | taxpc |
| Law Enforcment | polpc, prbarr, prbconv, mix |
| Judicial/Sentencing | prbpris, avgsen |
| Time Period | year |
The variables above operationalize the conditions we wish to explore and their affects on crime rate
Chiefly, these break down as follows.
The Economic variables measures the county’s economic activity and health (e.g. opportunity to pursue legal forms of income). These variables come in the form of available wages and tax revenue returned to the county.
The Law enforcment variables measures the county’s ability to utilize law enforcment policy to deter crime. Similarly, the Judicial variables also signify impact of deterence to crime.
The Demographic variables measure the cultural variability that represent the social differences between each county, such as urban vs rural and minority populations.
The Geographic elements are categorical. They represent they ways in which the population is segmented by geography.
We begin our analysis by loading the data set and performing basic checks and inspections.
dfCrime = read.csv("crime_v2.csv")
str(dfCrime)
'data.frame': 97 obs. of 25 variables:
$ county : int 1 3 5 7 9 11 13 15 17 19 ...
$ year : int 87 87 87 87 87 87 87 87 87 87 ...
$ crmrte : num 0.0356 0.0153 0.013 0.0268 0.0106 ...
$ prbarr : num 0.298 0.132 0.444 0.365 0.518 ...
$ prbconv : Factor w/ 92 levels "","`","0.068376102",..: 63 89 13 62 52 3 59 78 42 86 ...
$ prbpris : num 0.436 0.45 0.6 0.435 0.443 ...
$ avgsen : num 6.71 6.35 6.76 7.14 8.22 ...
$ polpc : num 0.001828 0.000746 0.001234 0.00153 0.00086 ...
$ density : num 2.423 1.046 0.413 0.492 0.547 ...
$ taxpc : num 31 26.9 34.8 42.9 28.1 ...
$ west : int 0 0 1 0 1 1 0 0 0 0 ...
$ central : int 1 1 0 1 0 0 0 0 0 0 ...
$ urban : int 0 0 0 0 0 0 0 0 0 0 ...
$ pctmin80: num 20.22 7.92 3.16 47.92 1.8 ...
$ wcon : num 281 255 227 375 292 ...
$ wtuc : num 409 376 372 398 377 ...
$ wtrd : num 221 196 229 191 207 ...
$ wfir : num 453 259 306 281 289 ...
$ wser : num 274 192 210 257 215 ...
$ wmfg : num 335 300 238 282 291 ...
$ wfed : num 478 410 359 412 377 ...
$ wsta : num 292 363 332 328 367 ...
$ wloc : num 312 301 281 299 343 ...
$ mix : num 0.0802 0.0302 0.4651 0.2736 0.0601 ...
$ pctymle : num 0.0779 0.0826 0.0721 0.0735 0.0707 ...
head(dfCrime)
county year crmrte prbarr prbconv prbpris avgsen polpc
1 1 87 0.0356036 0.298270 0.527595997 0.436170 6.71 0.00182786
2 3 87 0.0152532 0.132029 1.481480002 0.450000 6.35 0.00074588
3 5 87 0.0129603 0.444444 0.267856985 0.600000 6.76 0.00123431
4 7 87 0.0267532 0.364760 0.525424004 0.435484 7.14 0.00152994
5 9 87 0.0106232 0.518219 0.476563007 0.442623 8.22 0.00086018
6 11 87 0.0146067 0.524664 0.068376102 0.500000 13.00 0.00288203
density taxpc west central urban pctmin80 wcon wtuc
1 2.4226327 30.99368 0 1 0 20.21870 281.4259 408.7245
2 1.0463320 26.89208 0 1 0 7.91632 255.1020 376.2542
3 0.4127659 34.81605 1 0 0 3.16053 226.9470 372.2084
4 0.4915572 42.94759 0 1 0 47.91610 375.2345 397.6901
5 0.5469484 28.05474 1 0 0 1.79619 292.3077 377.3126
6 0.6113361 35.22974 1 0 0 1.54070 250.4006 401.3378
wtrd wfir wser wmfg wfed wsta wloc mix
1 221.2701 453.1722 274.1775 334.54 477.58 292.09 311.91 0.08016878
2 196.0101 258.5650 192.3077 300.38 409.83 362.96 301.47 0.03022670
3 229.3209 305.9441 209.6972 237.65 358.98 331.53 281.37 0.46511629
4 191.1720 281.0651 256.7214 281.80 412.15 328.27 299.03 0.27362204
5 206.8215 289.3125 215.1933 290.89 377.35 367.23 342.82 0.06008584
6 187.8255 258.5650 237.1507 258.60 391.48 325.71 275.22 0.31952664
pctymle
1 0.07787097
2 0.08260694
3 0.07211538
4 0.07353726
5 0.07069755
6 0.09891920
tail(dfCrime)
county year crmrte prbarr prbconv prbpris avgsen polpc density taxpc
92 NA NA NA NA NA NA NA NA NA
93 NA NA NA NA NA NA NA NA NA
94 NA NA NA NA NA NA NA NA NA
95 NA NA NA NA NA NA NA NA NA
96 NA NA NA NA NA NA NA NA NA
97 NA NA NA NA ` NA NA NA NA NA
west central urban pctmin80 wcon wtuc wtrd wfir wser wmfg wfed wsta
92 NA NA NA NA NA NA NA NA NA NA NA NA
93 NA NA NA NA NA NA NA NA NA NA NA NA
94 NA NA NA NA NA NA NA NA NA NA NA NA
95 NA NA NA NA NA NA NA NA NA NA NA NA
96 NA NA NA NA NA NA NA NA NA NA NA NA
97 NA NA NA NA NA NA NA NA NA NA NA NA
wloc mix pctymle
92 NA NA NA
93 NA NA NA
94 NA NA NA
95 NA NA NA
96 NA NA NA
97 NA NA NA
summary(dfCrime)
county year crmrte prbarr
Min. : 1.0 Min. :87 Min. :0.005533 Min. :0.09277
1st Qu.: 52.0 1st Qu.:87 1st Qu.:0.020927 1st Qu.:0.20568
Median :105.0 Median :87 Median :0.029986 Median :0.27095
Mean :101.6 Mean :87 Mean :0.033400 Mean :0.29492
3rd Qu.:152.0 3rd Qu.:87 3rd Qu.:0.039642 3rd Qu.:0.34438
Max. :197.0 Max. :87 Max. :0.098966 Max. :1.09091
NA's :6 NA's :6 NA's :6 NA's :6
prbconv prbpris avgsen polpc
: 5 Min. :0.1500 Min. : 5.380 Min. :0.000746
0.588859022: 2 1st Qu.:0.3648 1st Qu.: 7.340 1st Qu.:0.001231
` : 1 Median :0.4234 Median : 9.100 Median :0.001485
0.068376102: 1 Mean :0.4108 Mean : 9.647 Mean :0.001702
0.140350997: 1 3rd Qu.:0.4568 3rd Qu.:11.420 3rd Qu.:0.001877
0.154451996: 1 Max. :0.6000 Max. :20.700 Max. :0.009054
(Other) :86 NA's :6 NA's :6 NA's :6
density taxpc west central
Min. :0.00002 Min. : 25.69 Min. :0.0000 Min. :0.0000
1st Qu.:0.54741 1st Qu.: 30.66 1st Qu.:0.0000 1st Qu.:0.0000
Median :0.96226 Median : 34.87 Median :0.0000 Median :0.0000
Mean :1.42884 Mean : 38.06 Mean :0.2527 Mean :0.3736
3rd Qu.:1.56824 3rd Qu.: 40.95 3rd Qu.:0.5000 3rd Qu.:1.0000
Max. :8.82765 Max. :119.76 Max. :1.0000 Max. :1.0000
NA's :6 NA's :6 NA's :6 NA's :6
urban pctmin80 wcon wtuc
Min. :0.00000 Min. : 1.284 Min. :193.6 Min. :187.6
1st Qu.:0.00000 1st Qu.: 9.845 1st Qu.:250.8 1st Qu.:374.6
Median :0.00000 Median :24.312 Median :281.4 Median :406.5
Mean :0.08791 Mean :25.495 Mean :285.4 Mean :411.7
3rd Qu.:0.00000 3rd Qu.:38.142 3rd Qu.:314.8 3rd Qu.:443.4
Max. :1.00000 Max. :64.348 Max. :436.8 Max. :613.2
NA's :6 NA's :6 NA's :6 NA's :6
wtrd wfir wser wmfg
Min. :154.2 Min. :170.9 Min. : 133.0 Min. :157.4
1st Qu.:190.9 1st Qu.:286.5 1st Qu.: 229.7 1st Qu.:288.9
Median :203.0 Median :317.3 Median : 253.2 Median :320.2
Mean :211.6 Mean :322.1 Mean : 275.6 Mean :335.6
3rd Qu.:225.1 3rd Qu.:345.4 3rd Qu.: 280.5 3rd Qu.:359.6
Max. :354.7 Max. :509.5 Max. :2177.1 Max. :646.9
NA's :6 NA's :6 NA's :6 NA's :6
wfed wsta wloc mix
Min. :326.1 Min. :258.3 Min. :239.2 Min. :0.01961
1st Qu.:400.2 1st Qu.:329.3 1st Qu.:297.3 1st Qu.:0.08074
Median :449.8 Median :357.7 Median :308.1 Median :0.10186
Mean :442.9 Mean :357.5 Mean :312.7 Mean :0.12884
3rd Qu.:478.0 3rd Qu.:382.6 3rd Qu.:329.2 3rd Qu.:0.15175
Max. :598.0 Max. :499.6 Max. :388.1 Max. :0.46512
NA's :6 NA's :6 NA's :6 NA's :6
pctymle
Min. :0.06216
1st Qu.:0.07443
Median :0.07771
Mean :0.08396
3rd Qu.:0.08350
Max. :0.24871
NA's :6
First, we note there are missing rows in the dataset that were imported. We’ll remove those rows now.
nrow(dfCrime)
[1] 97
dfCrime <-na.omit(dfCrime) # omit the NA rows
nrow(dfCrime)
[1] 91
Next, we will inspect the data to see if there are duplicate records
dfCrime[duplicated(dfCrime),]
county year crmrte prbarr prbconv prbpris avgsen polpc
89 193 87 0.0235277 0.266055 0.588859022 0.423423 5.86 0.00117887
density taxpc west central urban pctmin80 wcon wtuc
89 0.8138298 28.51783 1 0 0 5.93109 285.8289 480.1948
wtrd wfir wser wmfg wfed wsta wloc mix
89 268.3836 365.0196 295.9352 295.63 468.26 337.88 348.74 0.1105016
pctymle
89 0.07819394
A duplicate row exists. We’ll remove it.
dfCrime <- dfCrime[!duplicated(dfCrime),] # remove the duplicated row
nrow(dfCrime)
[1] 90
We also saw that pbconv was coded as a level. It is not a level but a ratio. We’ll change that now.
dfCrime$prbconv<-as.numeric(levels(dfCrime$prbconv))[dfCrime$prbconv]
We also notice by comparision of pctymle and pctmin80 one of the variables is off by a factor of 100. We will divide pctmin80 by 100 so the two variables are in the same unit terms.
dfCrime$pctmin80<-dfCrime$pctmin80/100
County was expressed as a number. However, it is a categorical variable and we will convert it to a factor instead.
dfCrime$county<-as.factor(dfCrime$county)
Next we inspect the indicator variables to see if they were coded correctly.
dfCrime %>% group_by(west, central) %>% tally()
# A tibble: 4 x 3
# Groups: west [2]
west central n
<int> <int> <int>
1 0 0 35
2 0 1 33
3 1 0 21
4 1 1 1
dfCrime %>%
filter(west ==1 & central ==1)
county year crmrte prbarr prbconv prbpris avgsen polpc
1 71 87 0.0544061 0.243119 0.22959 0.379175 11.29 0.00207028
density taxpc west central urban pctmin80 wcon wtuc wtrd
1 4.834734 31.53658 1 1 0 0.13315 291.4508 595.3719 240.3673
wfir wser wmfg wfed wsta wloc mix pctymle
1 348.0254 295.2301 358.95 509.43 359.11 339.58 0.1018608 0.07939028
One county was either mis-coded, or it truly belongs to both regions. However, this is very unlikely as the intended technique is to widen the data and introduce indicator variables for each category. It is not likley the data was captured for both categories.
We will need further analysis on this datapoint as it relates to the rest of the data.
For now, we will encode a new region variable and place the datapoint in its own category.
#Map central and west to a region code, and create a new category for other
# Note that county 71 has both western and central codes
dfCrime$region <- case_when (
(dfCrime$central ==0 & dfCrime$west ==0) ~ 0, #Eastern, Coastal, Other
(dfCrime$central ==0 & dfCrime$west ==1) ~ 1, #Western
(dfCrime$central ==1 & dfCrime$west ==0) ~ 2, #Central
(dfCrime$central ==1 & dfCrime$west ==1) ~ 3 #Central-Western border county - this could be Charlotte or it could be a coding error.
)
dfCrime$regcode =
factor( dfCrime$region , levels = 0:3 , labels =
c( 'O',
'W',
'C',
'CW')
)
We will also introduce an indicator variable for counties located in the “other” region that are not west or central
dfCrime$other <- ifelse((dfCrime$central ==0 & dfCrime$west ==0), 1, 0)
And we’ll add an indicator variable to serve as complement to the urban indicator variable and call this ‘nonurban’
dfCrime$nonurban <- ifelse((dfCrime$urban==0), 1, 0)
By way of the 1980 Census fact sheet, we discover the urban field is an encoding for SMSA (Standard Metropolitan Statistical Areas). https://www2.census.gov/prod2/decennial/documents/1980/1980censusofpopu8011uns_bw.pdf The value is one if the county is inside a metropolitan area. Otherwise, if the county is outisde a metropolitan area, the value is zero.
We create a metro factor variable to better describe this feature.
# create factor for SMSA (standard metropolitan statistical areas) with two levels - inside or outside
# https://www2.census.gov/prod2/decennial/documents/1980/1980censusofpopu8011uns_bw.pdf
dfCrime$metro =
factor( dfCrime$urban , levels = 0:1 , labels =
c( 'Outside',
'Inside'
)
)
Next we will visualize each variable and its relationship to the variable crmrte through scatter plots
#Plot of the economic and tax related variables vs crmrte
q1<-ggplot(data = dfCrime, aes(x = wcon, y = crmrte, color = region)) +
geom_point()+
geom_smooth(method = "lm")
q2<-ggplot(data = dfCrime, aes(x = wtuc, y = crmrte, color = region)) +
geom_point()+
geom_smooth(method = "lm")
q3<-ggplot(data = dfCrime, aes(x = wtrd, y = crmrte, color = region)) +
geom_point()+
geom_smooth(method = "lm")
q4<-ggplot(data = dfCrime, aes(x = wfir, y = crmrte, color = region)) +
geom_point()+
geom_smooth(method = "lm")
q5<-ggplot(data = dfCrime, aes(x = wser, y = crmrte, color = region)) +
geom_point()+
geom_smooth(method = "lm")
q6<-ggplot(data = dfCrime, aes(x = wmfg, y = crmrte, color = region)) +
geom_point()+
geom_smooth(method = "lm")
q7<-ggplot(data = dfCrime, aes(x = wfed, y = crmrte, color = region)) +
geom_point()+
geom_smooth(method = "lm")
q8<-ggplot(data = dfCrime, aes(x = wsta, y = crmrte, color = region)) +
geom_point()+
geom_smooth(method = "lm")
q9<-ggplot(data = dfCrime, aes(x = wloc, y = crmrte, color = region)) +
geom_point()+
geom_smooth(method = "lm")
q10<-ggplot(data = dfCrime, aes(x = taxpc, y = crmrte, color = region)) +
geom_point()+
geom_smooth(method = "lm")
grid.arrange(q1, q2, q3, q4, q5, q6, q7, q8, q9, q10, ncol=2)
We observe a few data points of interest in the comparison above, notably, wser and taxpc appear to have extreme data points
Other variables show outliers as well, but not as extreme. We will see if any of these points have leverage or influence if chosen for models.
For now, lets dig further into the extreme outliers from our immediate visual inspection.
dfCrime %>%
filter(wser > 2000) %>%
select(county, wser)
county wser
1 185 2177.068
This average service wage is much too high based on what we know about the 1980s and every other wage recorded in comparison. A review of the detailed population statistics describing mean wage per industry (table 231) confirms this. https://www2.census.gov/prod2/decennial/documents/1980/1980censusofpopu801352uns_bw.pdf
We will adjust this wage by replacing it with an imputed value from the sample population. To impute this value we will rely on the package Hmisc to derive it for us.
dfCrime$wser[which(dfCrime$county==185)]<-NA # set the value to NA so it will be imputed
impute_arg <- aregImpute(~ crmrte + urban + central + west + other +
prbarr + prbconv + prbpris + avgsen + polpc +
density + taxpc + pctmin80 + wcon + wtuc +
wtrd + wfir + wser + wmfg + wfed + wsta + wloc +
mix + pctymle, data = dfCrime, n.impute = 30)
Iteration 1
Iteration 2
Iteration 3
Iteration 4
Iteration 5
Iteration 6
Iteration 7
Iteration 8
Iteration 9
Iteration 10
Iteration 11
Iteration 12
Iteration 13
Iteration 14
Iteration 15
Iteration 16
Iteration 17
Iteration 18
Iteration 19
Iteration 20
Iteration 21
Iteration 22
Iteration 23
Iteration 24
Iteration 25
Iteration 26
Iteration 27
Iteration 28
Iteration 29
Iteration 30
impute_arg
Multiple Imputation using Bootstrap and PMM
aregImpute(formula = ~crmrte + urban + central + west + other +
prbarr + prbconv + prbpris + avgsen + polpc + density + taxpc +
pctmin80 + wcon + wtuc + wtrd + wfir + wser + wmfg + wfed +
wsta + wloc + mix + pctymle, data = dfCrime, n.impute = 30)
n: 90 p: 24 Imputations: 30 nk: 3
Number of NAs:
crmrte urban central west other prbarr prbconv prbpris
0 0 0 0 0 0 0 0
avgsen polpc density taxpc pctmin80 wcon wtuc wtrd
0 0 0 0 0 0 0 0
wfir wser wmfg wfed wsta wloc mix pctymle
0 1 0 0 0 0 0 0
type d.f.
crmrte s 2
urban l 1
central l 1
west l 1
other l 1
prbarr s 2
prbconv s 2
prbpris s 2
avgsen s 2
polpc s 2
density s 2
taxpc s 2
pctmin80 s 2
wcon s 2
wtuc s 2
wtrd s 2
wfir s 2
wser s 1
wmfg s 2
wfed s 2
wsta s 2
wloc s 2
mix s 2
pctymle s 2
Transformation of Target Variables Forced to be Linear
R-squares for Predicting Non-Missing Values for Each Variable
Using Last Imputations of Predictors
wser
0.959
df<-melt(t(impute_arg$imputed$wser), value.name=c("wser"))
options(repr.plot.width=4, repr.plot.height=2)
ggplot(data = df, aes(x = wser)) + geom_histogram(bins=30)
options(repr.plot.width=NULL, repr.plot.height=NULL)
We will reassign the value in our dataset to the median from these trials.
dfCrime$wser[which(dfCrime$county==185)]<-median(impute_arg$imputed$wser)
dfCrime$wser[which(dfCrime$county==185)]
[1] 266.7583
Next, let’s examine the tax per capita outlier
dfCrime %>%
filter(taxpc > 100)
county year crmrte prbarr prbconv prbpris avgsen polpc
1 55 87 0.0790163 0.224628 0.207831 0.304348 13.57 0.00400962
density taxpc west central urban pctmin80 wcon wtuc
1 0.5115089 119.7615 0 0 0 0.0649622 309.5238 445.2762
wtrd wfir wser wmfg wfed wsta wloc mix
1 189.7436 284.5933 221.3903 319.21 338.91 361.68 326.08 0.08437271
pctymle region regcode other nonurban metro
1 0.07613807 0 O 1 1 Outside
The tax revenue per capita in this county is excessive. There is nothing in the wage variables that would indicate more tax revenues should be captured than what is normal. We will adjust this taxpc data point by replacing it by imputing its value from the sample.
dfCrime$taxpc[which(dfCrime$county==55)]<- NA
impute_arg <- aregImpute(~ crmrte + urban + central + west + other +
prbarr + prbconv + prbpris + avgsen + polpc +
density + taxpc + pctmin80 + wcon + wtuc +
wtrd + wfir + wser + wmfg + wfed + wsta + wloc +
mix + pctymle, data = dfCrime, n.impute = 30)
Iteration 1
Iteration 2
Iteration 3
Iteration 4
Iteration 5
Iteration 6
Iteration 7
Iteration 8
Iteration 9
Iteration 10
Iteration 11
Iteration 12
Iteration 13
Iteration 14
Iteration 15
Iteration 16
Iteration 17
Iteration 18
Iteration 19
Iteration 20
Iteration 21
Iteration 22
Iteration 23
Iteration 24
Iteration 25
Iteration 26
Iteration 27
Iteration 28
Iteration 29
Iteration 30
impute_arg
Multiple Imputation using Bootstrap and PMM
aregImpute(formula = ~crmrte + urban + central + west + other +
prbarr + prbconv + prbpris + avgsen + polpc + density + taxpc +
pctmin80 + wcon + wtuc + wtrd + wfir + wser + wmfg + wfed +
wsta + wloc + mix + pctymle, data = dfCrime, n.impute = 30)
n: 90 p: 24 Imputations: 30 nk: 3
Number of NAs:
crmrte urban central west other prbarr prbconv prbpris
0 0 0 0 0 0 0 0
avgsen polpc density taxpc pctmin80 wcon wtuc wtrd
0 0 0 1 0 0 0 0
wfir wser wmfg wfed wsta wloc mix pctymle
0 0 0 0 0 0 0 0
type d.f.
crmrte s 2
urban l 1
central l 1
west l 1
other l 1
prbarr s 2
prbconv s 2
prbpris s 2
avgsen s 2
polpc s 2
density s 2
taxpc s 1
pctmin80 s 2
wcon s 2
wtuc s 2
wtrd s 2
wfir s 2
wser s 2
wmfg s 2
wfed s 2
wsta s 2
wloc s 2
mix s 2
pctymle s 2
Transformation of Target Variables Forced to be Linear
R-squares for Predicting Non-Missing Values for Each Variable
Using Last Imputations of Predictors
taxpc
0.795
df<-melt(t(impute_arg$imputed$taxpc), value.name=c("taxpc"))
options(repr.plot.width=4, repr.plot.height=2)
ggplot(data = df, aes(x = taxpc)) + geom_histogram(bins=30)
options(repr.plot.width=NULL, repr.plot.height=NULL)
dfCrime$taxpc[which(dfCrime$county==55)]<-median(impute_arg$imputed$taxpc)
dfCrime$taxpc[which(dfCrime$county==55)]
[1] 34.48254
#Plot of the criminal justice and law enforcment related variables vs crmrte
q1<-ggplot(data = dfCrime, aes(x = prbarr, y = crmrte, color = region)) +
geom_point()+
geom_smooth(method = "lm")
q2<-ggplot(data = dfCrime, aes(x = prbconv, y = crmrte, color = region)) +
geom_point()+
geom_smooth(method = "lm")
q3<-ggplot(data = dfCrime, aes(x = prbpris, y = crmrte, color = region)) +
geom_point()+
geom_smooth(method = "lm")
q4<-ggplot(data = dfCrime, aes(x = avgsen, y = crmrte, color = region)) +
geom_point()+
geom_smooth(method = "lm")
q5<-ggplot(data = dfCrime, aes(x = polpc, y = crmrte, color = region)) +
geom_point()+
geom_smooth(method = "lm")
q6<-ggplot(data = dfCrime, aes(x = mix, y = crmrte, color = region)) +
geom_point()+
geom_smooth(method = "lm")
grid.arrange(q1, q2, q3, q4, q5, q6, ncol=2)
The criminal justice and law enforcement variables also show evidence of possible outliers, notably, pbarr and polpc appear to have extreme data points
We also see that prbarr and prbconv have values greater than 1. However, these are not true probabity numbers and are instead ratios used as a stand in for the true probability numbers.
There is a possibility of higher arrests per incident for an area. Meaning, the area has low incidents in general but when there were incidents there were also multiple arrests. The same case can be made for the convictions per arrest variable which we see is for a different region. In that county there may have been multiple charges brought per one arrest.
#plot of demographic information for counties Outside and Inside the metro areas
# population density, percent minority, percent young male
q1<-ggplot(data = dfCrime, aes(x = density, y = crmrte, color = region)) +
geom_point() + facet_wrap(~ metro) +
geom_smooth(method = "lm")
q2<-ggplot(data = dfCrime, aes(x = pctmin80, y = crmrte, color = region)) +
geom_point() + facet_wrap(~ metro) +
geom_smooth(method = "lm")
q3<-ggplot(data = dfCrime, aes(x = pctymle, y = crmrte, color = region)) +
geom_point()+ facet_wrap(~ metro) +
geom_smooth(method = "lm")
grid.arrange(q1, q2, q3, ncol=1)
Notably more outliers are observed in demographic information. Here, pctymle in one county outside of a metro area is nearly 25%. That seems quite high in normal statistical measures of the population, however, this can be explained as a county having a large college town population.
Finally, we can see our bright blue region 3 county and notice its population density. Its behavior is more similar to an inside metro area. Than outside. In addition to be coded for both western and central regions, it appears to be miscoded here as well.
We will address the metro variable, and examine whether the region variable should be west, central or other instead of both central and west
dfCrime %>%
filter(west ==1 & central ==1) %>%
select(county, west, central, other, urban, region, regcode, metro)
county west central other urban region regcode metro
1 71 1 1 0 0 3 CW Outside
dfCrime$west[which(dfCrime$county==71)]<-NA
dfCrime$central[which(dfCrime$county==71)]<-NA
dfCrime$other[which(dfCrime$county==71)]<-NA
dfCrime$urban[which(dfCrime$county==71)]<-NA
impute_arg <- aregImpute(~ crmrte + urban + central + west +
prbarr + prbconv + prbpris + avgsen + polpc +
density + taxpc + pctmin80 + wcon + wtuc +
wtrd + wfir + wser + wmfg + wfed + wsta + wloc +
mix + pctymle, data = dfCrime, match="closest", n.impute = 100)
Iteration 1
Iteration 2
Iteration 3
Iteration 4
Iteration 5
Iteration 6
Iteration 7
Iteration 8
Iteration 9
Iteration 10
Iteration 11
Iteration 12
Iteration 13
Iteration 14
Iteration 15
Iteration 16
Iteration 17
Iteration 18
Iteration 19
Iteration 20
Iteration 21
Iteration 22
Iteration 23
Iteration 24
Iteration 25
Iteration 26
Iteration 27
Iteration 28
Iteration 29
Iteration 30
Iteration 31
Iteration 32
Iteration 33
Iteration 34
Iteration 35
Iteration 36
Iteration 37
Iteration 38
Iteration 39
Iteration 40
Iteration 41
Iteration 42
Iteration 43
Iteration 44
Iteration 45
Iteration 46
Iteration 47
Iteration 48
Iteration 49
Iteration 50
Iteration 51
Iteration 52
Iteration 53
Iteration 54
Iteration 55
Iteration 56
Iteration 57
Iteration 58
Iteration 59
Iteration 60
Iteration 61
Iteration 62
Iteration 63
Iteration 64
Iteration 65
Iteration 66
Iteration 67
Iteration 68
Iteration 69
Iteration 70
Iteration 71
Iteration 72
Iteration 73
Iteration 74
Iteration 75
Iteration 76
Iteration 77
Iteration 78
Iteration 79
Iteration 80
Iteration 81
Iteration 82
Iteration 83
Iteration 84
Iteration 85
Iteration 86
Iteration 87
Iteration 88
Iteration 89
Iteration 90
Iteration 91
Iteration 92
Iteration 93
Iteration 94
Iteration 95
Iteration 96
Iteration 97
Iteration 98
Iteration 99
Iteration 100
Iteration 101
Iteration 102
Iteration 103
impute_arg
Multiple Imputation using Bootstrap and PMM
aregImpute(formula = ~crmrte + urban + central + west + prbarr +
prbconv + prbpris + avgsen + polpc + density + taxpc + pctmin80 +
wcon + wtuc + wtrd + wfir + wser + wmfg + wfed + wsta + wloc +
mix + pctymle, data = dfCrime, n.impute = 100, match = "closest")
n: 90 p: 23 Imputations: 100 nk: 3
Number of NAs:
crmrte urban central west prbarr prbconv prbpris avgsen
0 1 1 1 0 0 0 0
polpc density taxpc pctmin80 wcon wtuc wtrd wfir
0 0 0 0 0 0 0 0
wser wmfg wfed wsta wloc mix pctymle
0 0 0 0 0 0 0
type d.f.
crmrte s 2
urban l 1
central l 1
west l 1
prbarr s 2
prbconv s 2
prbpris s 2
avgsen s 2
polpc s 2
density s 2
taxpc s 2
pctmin80 s 2
wcon s 2
wtuc s 2
wtrd s 2
wfir s 2
wser s 2
wmfg s 2
wfed s 2
wsta s 2
wloc s 2
mix s 2
pctymle s 2
Transformation of Target Variables Forced to be Linear
R-squares for Predicting Non-Missing Values for Each Variable
Using Last Imputations of Predictors
urban central west
0.961 0.900 0.951
df<-melt(t(impute_arg$imputed$central), value.name=c("central"))
options(repr.plot.width=4, repr.plot.height=2)
ggplot(data = df, aes(x = central)) + geom_histogram(bins=30)
options(repr.plot.width=NULL, repr.plot.height=NULL)
df<-melt(t(impute_arg$imputed$west), value.name=c("west"))
options(repr.plot.width=4, repr.plot.height=2)
ggplot(data = df, aes(x = west)) + geom_histogram(bins=30)
options(repr.plot.width=NULL, repr.plot.height=NULL)
df<-melt(t(impute_arg$imputed$urban), value.name=c("urban"))
options(repr.plot.width=4, repr.plot.height=2)
ggplot(data = df, aes(x = urban)) + geom_histogram(bins=30)
options(repr.plot.width=NULL, repr.plot.height=NULL)
The results confirm the county is urban. It is also highly probable that county 71 is not west and is most associated with central. After correcting our data for urban and west, let’s compare ‘central’ with ‘other’ to be certain we have the right region.
dfCrime$urban[which(dfCrime$county==71)]<-median(impute_arg$imputed$urban)
dfCrime$nonurban[which(dfCrime$county==71)]<-(1-median(impute_arg$imputed$urban))
dfCrime$west[which(dfCrime$county==71)]<-median(impute_arg$imputed$west)
impute_arg <- aregImpute(~ crmrte + urban + central + other +
prbarr + prbconv + prbpris + avgsen + polpc +
density + taxpc + pctmin80 + wcon + wtuc +
wtrd + wfir + wser + wmfg + wfed + wsta + wloc +
mix + pctymle, data = dfCrime, match="closest", n.impute = 100)
Iteration 1
Iteration 2
Iteration 3
Iteration 4
Iteration 5
Iteration 6
Iteration 7
Iteration 8
Iteration 9
Iteration 10
Iteration 11
Iteration 12
Iteration 13
Iteration 14
Iteration 15
Iteration 16
Iteration 17
Iteration 18
Iteration 19
Iteration 20
Iteration 21
Iteration 22
Iteration 23
Iteration 24
Iteration 25
Iteration 26
Iteration 27
Iteration 28
Iteration 29
Iteration 30
Iteration 31
Iteration 32
Iteration 33
Iteration 34
Iteration 35
Iteration 36
Iteration 37
Iteration 38
Iteration 39
Iteration 40
Iteration 41
Iteration 42
Iteration 43
Iteration 44
Iteration 45
Iteration 46
Iteration 47
Iteration 48
Iteration 49
Iteration 50
Iteration 51
Iteration 52
Iteration 53
Iteration 54
Iteration 55
Iteration 56
Iteration 57
Iteration 58
Iteration 59
Iteration 60
Iteration 61
Iteration 62
Iteration 63
Iteration 64
Iteration 65
Iteration 66
Iteration 67
Iteration 68
Iteration 69
Iteration 70
Iteration 71
Iteration 72
Iteration 73
Iteration 74
Iteration 75
Iteration 76
Iteration 77
Iteration 78
Iteration 79
Iteration 80
Iteration 81
Iteration 82
Iteration 83
Iteration 84
Iteration 85
Iteration 86
Iteration 87
Iteration 88
Iteration 89
Iteration 90
Iteration 91
Iteration 92
Iteration 93
Iteration 94
Iteration 95
Iteration 96
Iteration 97
Iteration 98
Iteration 99
Iteration 100
Iteration 101
Iteration 102
Iteration 103
impute_arg
Multiple Imputation using Bootstrap and PMM
aregImpute(formula = ~crmrte + urban + central + other + prbarr +
prbconv + prbpris + avgsen + polpc + density + taxpc + pctmin80 +
wcon + wtuc + wtrd + wfir + wser + wmfg + wfed + wsta + wloc +
mix + pctymle, data = dfCrime, n.impute = 100, match = "closest")
n: 90 p: 23 Imputations: 100 nk: 3
Number of NAs:
crmrte urban central other prbarr prbconv prbpris avgsen
0 0 1 1 0 0 0 0
polpc density taxpc pctmin80 wcon wtuc wtrd wfir
0 0 0 0 0 0 0 0
wser wmfg wfed wsta wloc mix pctymle
0 0 0 0 0 0 0
type d.f.
crmrte s 2
urban l 1
central l 1
other l 1
prbarr s 2
prbconv s 2
prbpris s 2
avgsen s 2
polpc s 2
density s 2
taxpc s 2
pctmin80 s 2
wcon s 2
wtuc s 2
wtrd s 2
wfir s 2
wser s 2
wmfg s 2
wfed s 2
wsta s 2
wloc s 2
mix s 2
pctymle s 2
Transformation of Target Variables Forced to be Linear
R-squares for Predicting Non-Missing Values for Each Variable
Using Last Imputations of Predictors
central other
0.937 0.971
df<-melt(t(impute_arg$imputed$central), value.name=c("central"))
options(repr.plot.width=4, repr.plot.height=2)
ggplot(data = df, aes(x = central)) + geom_histogram(bins=30)
options(repr.plot.width=NULL, repr.plot.height=NULL)
median(impute_arg$imputed$central)
[1] 0.5
df<-melt(t(impute_arg$imputed$other), value.name=c("other"))
options(repr.plot.width=4, repr.plot.height=2)
ggplot(data = df, aes(x = other)) + geom_histogram(bins=30)
options(repr.plot.width=NULL, repr.plot.height=NULL)
Assign our new values
dfCrime$central[which(dfCrime$county==71)]<-median(impute_arg$imputed$central)
dfCrime$other[which(dfCrime$county==71)]<-median(impute_arg$imputed$other)
Recode the categories for region and metro
dfCrime$region <- case_when (
(dfCrime$central ==0 & dfCrime$west ==0) ~ 0, #Eastern, Coastal, Other
(dfCrime$central ==0 & dfCrime$west ==1) ~ 1, #Western
(dfCrime$central ==1 & dfCrime$west ==0) ~ 2 #Central
)
dfCrime$regcode =
factor( dfCrime$region , levels = 0:2 , labels =
c( 'O',
'W',
'C' )
)
dfCrime$metro =
factor( dfCrime$urban , levels = 0:1 , labels =
c( 'Outside',
'Inside'
)
)
dfCrime %>%
filter(county == 71) %>%
select(county, west, central, urban, region, regcode, metro)
county west central urban region regcode metro
1 71 0 0.5 1 NA <NA> Inside
Let’s review our density numbers again by looking in more detail at its distribution.
#options(repr.plot.width=8, repr.plot.height=4)
ggplot(data = dfCrime, aes(x = density)) +
geom_histogram(bins=90)
We note that one of the counties has an extremely low density. Near zero.
dfCrime %>%
filter(density < 0.01)
county year crmrte prbarr prbconv prbpris avgsen polpc
1 173 87 0.0139937 0.530435 0.327869 0.15 6.64 0.00316379
density taxpc west central urban pctmin80 wcon wtuc
1 2.03422e-05 37.72702 1 0 0 0.253914 231.696 213.6752
wtrd wfir wser wmfg wfed wsta wloc mix
1 175.1604 267.094 204.3792 193.01 334.44 414.68 304.32 0.4197531
pctymle region regcode other nonurban metro
1 0.07462687 1 W 0 1 Outside
In review of the North Carolina county density data from 1985, the smallest population density in any county in North Carolina is 0.0952. This makes the density of 0.0000203422 for county 173 statistically impossible. It is miscoded.
http://ncosbm.s3.amazonaws.com/s3fs-public/demog/dens7095.xls
(Note to team: We could use this table if we want to assign names to our counties by comparing the population densities. What is interesting is that the 6 rows of missing values we removed earlier can be found in the tail of this table. There was an arbitrary cut off after a certain density - lkely because the counties were not statistically significant. County 173 is not one of those counties, however, as our imputation process will demonstrate.)
dfCrime$density[which(dfCrime$county==173)]<- NA
dfSubset <- dfCrime %>% filter(urban==0 & west ==1) #we will use the other non-urban western counties to find a replacement
impute_arg <- aregImpute(~ crmrte + # urban + central + west + other + # region variables become constant and no longer apply
prbarr + prbconv + prbpris + avgsen + polpc +
density + taxpc + pctmin80 + wcon + wtuc +
wtrd + wfir + wser + wmfg + wfed + wsta + wloc +
mix + pctymle, data = dfSubset, match="closest", n.impute = 30)
Iteration 1
Iteration 2
Iteration 3
Iteration 4
Iteration 5
Iteration 6
Iteration 7
Iteration 8
Iteration 9
Iteration 10
Iteration 11
Iteration 12
Iteration 13
Iteration 14
Iteration 15
Iteration 16
Iteration 17
Iteration 18
Iteration 19
Iteration 20
Iteration 21
Iteration 22
Iteration 23
Iteration 24
Iteration 25
Iteration 26
Iteration 27
Iteration 28
Iteration 29
Iteration 30
impute_arg
Multiple Imputation using Bootstrap and PMM
aregImpute(formula = ~crmrte + prbarr + prbconv + prbpris + avgsen +
polpc + density + taxpc + pctmin80 + wcon + wtuc + wtrd +
wfir + wser + wmfg + wfed + wsta + wloc + mix + pctymle,
data = dfSubset, n.impute = 30, match = "closest")
n: 20 p: 20 Imputations: 30 nk: 3
Number of NAs:
crmrte prbarr prbconv prbpris avgsen polpc density taxpc
0 0 0 0 0 0 1 0
pctmin80 wcon wtuc wtrd wfir wser wmfg wfed
0 0 0 0 0 0 0 0
wsta wloc mix pctymle
0 0 0 0
type d.f.
crmrte s 2
prbarr s 2
prbconv s 2
prbpris s 2
avgsen s 2
polpc s 2
density s 1
taxpc s 2
pctmin80 s 2
wcon s 2
wtuc s 2
wtrd s 2
wfir s 2
wser s 2
wmfg s 2
wfed s 2
wsta s 2
wloc s 2
mix s 2
pctymle s 2
Transformation of Target Variables Forced to be Linear
R-squares for Predicting Non-Missing Values for Each Variable
Using Last Imputations of Predictors
density
1
df<-melt(t(impute_arg$imputed$density), value.name=c("density"))
options(repr.plot.width=4, repr.plot.height=2)
ggplot(data = df, aes(x = density)) + geom_histogram(bins=30)
options(repr.plot.width=NULL, repr.plot.height=NULL)
dfCrime$density[which(dfCrime$county==173)]<-median(impute_arg$imputed$density)
dfCrime$density[which(dfCrime$county==173)]
[1] 1.505422
Now, we will examine histograms for the remaining variables
dfEconVars <- as.data.frame(cbind(dfCrime$wcon, dfCrime$wtuc, dfCrime$wtrd, dfCrime$wfir, dfCrime$wser, dfCrime$wmfg, dfCrime$wfed, dfCrime$wsta, dfCrime$wloc))
names(dfEconVars) <- c('wcon', 'wtuc', 'wtrd', 'wfir', 'wser',
'wmfg', 'wfed', 'wsta', 'wloc')
ggplot(melt(dfEconVars),aes(x=value)) + geom_histogram(bins=30) + facet_wrap(~variable)
No id variables; using all as measure variables
Each histogram for the wage information looks evenly distributed. We have no further remark at this time. We move to the justice an law enforcement variables. With these variables being mostly < 1 we’ll also take the log for comparison.
q1<-ggplot(data = dfCrime, aes(x = prbarr)) +
geom_histogram(bins=30)
q11<-ggplot(data = dfCrime, aes(x = log(prbarr))) +
geom_histogram(bins=30)
q2<-ggplot(data = dfCrime, aes(x = prbconv)) +
geom_histogram(bins=30)
q21<-ggplot(data = dfCrime, aes(x = log(prbconv))) +
geom_histogram(bins=30)
q3<-ggplot(data = dfCrime, aes(x = prbpris)) +
geom_histogram(bins=30)
q31<-ggplot(data = dfCrime, aes(x = log(prbpris))) +
geom_histogram(bins=30)
q4<-ggplot(data = dfCrime, aes(x = avgsen)) +
geom_histogram(bins=30)
q41<-ggplot(data = dfCrime, aes(x = log(avgsen))) +
geom_histogram(bins=30)
q5<-ggplot(data = dfCrime, aes(x = polpc)) +
geom_histogram(bins=30)
q51<-ggplot(data = dfCrime, aes(x = log(polpc))) +
geom_histogram(bins=30)
q6<-ggplot(data = dfCrime, aes(x = mix)) +
geom_histogram(bins=30)
q61<-ggplot(data = dfCrime, aes(x = log(mix))) +
geom_histogram(bins=30)
grid.arrange(q1, q11, q2, q21, q3, q31, q4, q41, q5, q51, q6, q61, ncol=2)
The log transformation for these variables makes them more evenly distributed. We will transform these variables to their log equivalents and confirm with plots to see whether the result shows more linearity.
#Plot of the criminal justice and law enforcment related variables vs crmrte
q1<-ggplot(data = dfCrime, aes(x = log(prbarr), y = crmrte, color = region)) +
geom_point()+
geom_smooth(method = "lm")
q2<-ggplot(data = dfCrime, aes(x = log(prbconv), y = crmrte, color = region)) +
geom_point()+
geom_smooth(method = "lm")
q3<-ggplot(data = dfCrime, aes(x = log(prbpris), y = crmrte, color = region)) +
geom_point()+
geom_smooth(method = "lm")
q4<-ggplot(data = dfCrime, aes(x = log(avgsen), y = crmrte, color = region)) +
geom_point()+
geom_smooth(method = "lm")
q5<-ggplot(data = dfCrime, aes(x = log(polpc), y = crmrte, color = region)) +
geom_point()+
geom_smooth(method = "lm")
q6<-ggplot(data = dfCrime, aes(x = log(mix), y = crmrte, color = region)) +
geom_point()+
geom_smooth(method = "lm")
grid.arrange(q1, q2, q3, q4, q5, q6, ncol=2)
Of the six variables, only prbarr, prbconv and polpc show univariate correlation with crime. We believe these will be better candidates for our model selection. Further, we see mix has no correlation with crmrate and may be its own outcome variable.
dfCrime$logprbarr <- log(dfCrime$prbarr)
dfCrime$logprbconv <- log(dfCrime$prbconv)
dfCrime$logprbpris <- log(dfCrime$prbpris)
dfCrime$logavgsen <- log(dfCrime$avgsen)
dfCrime$logpolpc <- log(dfCrime$polpc)
dfCrime$logmix <- log(dfCrime$mix)
Next we take a look at the demographic histograms and their log alternatives
q1<-ggplot(data = dfCrime, aes(x = pctymle)) +
geom_histogram(bins=30)
q11<-ggplot(data = dfCrime, aes(x = log(pctymle))) +
geom_histogram(bins=30)
q2<-ggplot(data = dfCrime, aes(x = pctmin80)) +
geom_histogram(bins=30)
q21<-ggplot(data = dfCrime, aes(x = log(pctmin80))) +
geom_histogram(bins=30)
q3<-ggplot(data = dfCrime, aes(x = density)) +
geom_histogram(bins=30)
q31<-ggplot(data = dfCrime, aes(x = log(density))) +
geom_histogram(bins=30)
grid.arrange(q1, q11, q2, q21, q3, q31, ncol=2)
The shape after transformation make the data more distributed. We will include transfomations of these variables as well.
dfCrime$logdensity <- log(dfCrime$density)
dfCrime$logtaxpc <- log(dfCrime$taxpc)
dfCrime$logpctmin80 <- log(dfCrime$pctmin80)
dfCrime$logpctymle <- log(dfCrime$pctymle)
Finally, we’ll take a look at taxpc and the crmrte variable itself.
q1<-ggplot(data = dfCrime, aes(x = taxpc)) +
geom_histogram(bins=30)
q11<-ggplot(data = dfCrime, aes(x = log(taxpc))) +
geom_histogram(bins=30)
q2<-ggplot(data = dfCrime, aes(x = crmrte)) +
geom_histogram(bins=30)
q21<-ggplot(data = dfCrime, aes(x = log(crmrte))) +
geom_histogram(bins=30)
grid.arrange(q1, q11, q2, q21, ncol=2)
The crmrte and taxpc variables are more evenly distributed after transformation. We’ll add those to our dataframe.
dfCrime$logcrmrte = log(dfCrime$crmrte)
dfCrime4logtaxpc = log(dfCrime$taxpc)
As a final point of discussion we will identify additional variables we wish to operationalize for use in our models. The include a variable that expresses the economic condition of the county and a variable that expresses criminal justice effectiveness.
The first variable on the economic condition will include the sum of all average weekly wages from the 1980 census information. Since we do not know how many were employed at that wage we use this summary the best available proxy.
dfCrime$allWages<-dfCrime$wcon + dfCrime$wtuc + dfCrime$wtrd + dfCrime$wfir +
dfCrime$wser + dfCrime$wmfg + dfCrime$wfed + dfCrime$wsta + dfCrime$wloc
As a second variable, we are interested in understanding the effectiveness of the criminal justice system as a crime deterrent. Our proxy will be the number of convictions per incident.
This is operationalized by taking the probability of arrests, pbrarr (which is defined as arrests per incident) and multiplying by the probability of convictions, pbrconv (which is defined as convictions per arrest). The new variable is defined below.
dfCrime$crimJustEff<-dfCrime$prbarr * dfCrime$prbconv
We will also create a logarithmic transformation of this variable based on our histogram analysis from before.
dfCrime$logcrimJustEff<-log10(dfCrime$crimJustEff)
Our outcome variable is the crime rate (“crmrte”), which is defined as the crimes committed per person in a specific county during 1987. The crime rate of the 90 counties in our sample dataset range between 0.0055 - 0.0990, with a mean of 0.0335.
From the boxplot below, most of the counties have a crime rate between 0.0055 and 0.0700, with 5 outliers having a crime rate > 0.0700.
options(repr.plot.width=4, repr.plot.height=4)
ggplot(data = dfCrime, aes(y = crmrte)) +
geom_boxplot()
While mix (the type of crime committed) is also potentially an outcome variable, our research focuses on providing policy recommendations to reduce crime in general and not a specific type of crime. Mix is also not a linear outcome and hence difficult to measure.
We propose 3 multiple linear regression models
First Model: Has only the explanatory variables of key interest and no other covariates.
Second Model: Includes the explanatory variables and covariates that increase the accuracy of our results without substantial bias.
Third Model: An expansion of the second model with most covariates, designed to demonstrate the robustness of our results to model specification.
As we proceed with each model, we verify the CLM assumptions of OLS are addressed below: * MLR1 Linear in parameters: The models have had its data transformed as described above to allow a linear fit of the model. * MLR2 Random Sampling: The data is collected from a data set with rolled up data for each county. It is not randomly sampled by area or population. * MLR3 No perfect multicollinearity: None of the variables chosen for the model are constant or perfectly collinear as demonstrated by the scatterplot below. * MLR4’ The expectation of u and and covariance of each regressor with u are ~0. This shows that our model’s regressors are exogenous with the error. * MLR5’ Spherical errors: There is homoscedasticity and no autocorrelation [TBD]. * MLR6’ Our error terms should be normally distributed [TBD].
By satisfying these assumptions, we can expect our coefficients will be approaching the true parameter values in probability.
options(repr.plot.width=8, repr.plot.height=8)
pairs(~ logcrimJustEff + logpolpc + allWages + logtaxpc + logdensity + logpctmin80 + logpctymle,data=dfCrime, main="Scatterplot Matrix of Model Variables")
Our base hypothesis is that crime can be fundamentally explained by two factors: the effectiveness of the criminal justice system and the economic conditions.
Criminal Justice Effectiveness is self defined : To be able to track crimes, they must be reported to police, who can then make arrests and the legal system provides judgement (convictions/sentencing) Criminal justice also has a relationship to crime as a deterrent, as the probability of getting caught, convicted, sentenced could potentially deter crime.
We operationalize criminal justice effectiveness as follows: probability of Convictions * Crimes committed. We define as: prbconv * prbarr = conv/arrest * arrest/crime = convictions/crime. Without more granular data, this provides a single parsimonious metric that helps understand how the law enforcement and criminal justice system works.
Data Transformations
options(repr.plot.width=4, repr.plot.height=4)
hist((dfCrime$prbconv))
hist((dfCrime$prbarr))
The distribution of both probability of conviction and probability of arrest are peculiar and non-normal. It could be argued that both of these variables should be bound between 0 and 1. However, “probability” of conviction is proxied by a ratio of convictions to arrests. It is in fact common that defendents are charged with multiple crimes and convicted, but were only arrested once.
For “probability” of arrest, it could be possible there are multiple arrests for a single offense. However, the single data point that is greater than one, is >3 standard deviations away from the distribution. This outlier will have high leverage on our model and will be preemptively removed as the data supplied is likely in error and is not representative of the bulk of North Carolina counties.
For parsimony, we can simply the probability of arrest and probability of conviction by multiplying to effectively get the ratio of convictions to offenses. The normality of this factor can be improved by taking a log transform. QQ plots help to visualize how normality improves for the inner quartiles.
(dfCrime[51,]$prbarr - mean(dfCrime$prbarr))/sd(dfCrime$prbarr) # how many standard deviations away the outlier lies
[1] 5.779438
#hist(log(dfCrime$crimJustEff))
ggplot(data=dfCrime, aes(sample= crimJustEff)) + stat_qq() + stat_qq_line() + ggtitle("QQ Plot of Crim Just Eff")
dfCrime <- dfCrime[dfCrime$crimJustEff < 1,] # removing high flying outlier
ggplot(data=dfCrime, aes(sample= crimJustEff)) + stat_qq() + stat_qq_line() + ggtitle("QQ Plot of Crim Just Eff")
ggplot(data=dfCrime, aes(sample= logcrimJustEff)) + stat_qq() + stat_qq_line() +
ggtitle("QQ Plot of log transformed Crim Just Eff")
## Can show histogram/qqplot side by side in RMD.
We theorize that the second major cause of crime are bad economic conditions. When there are worse economic conditions, crime can be more attractive due to:
We operationalize economic conditions by looking at wages. For this model, we define this as the sum of all average wages in each county. We think this is best proxy from our data because it answers all of the above (higher wages leads to better means and better opportunities). From our EDA we also confirm that in general these sums are not skewed by having 1 really high paying sector in each county as we see a strong relationship between avg quartile across all job types and total sum. This can be seen in the chart below.
# # Quantiles for all jobs
dfWage<-mutate(dfCrime,qCon=ntile(dfCrime$wcon,4))
dfWage<-mutate(dfWage,qTuc=ntile(dfCrime$wtuc,4))
dfWage<-mutate(dfWage,qTrd=ntile(dfCrime$wtrd,4))
dfWage<-mutate(dfWage,qFir=ntile(dfCrime$wfir,4))
dfWage<-mutate(dfWage,qSer=ntile(dfCrime$wser,4))
dfWage<-mutate(dfWage,qMfg=ntile(dfCrime$wmfg,4))
dfWage<-mutate(dfWage,qFed=ntile(dfCrime$wfed,4))
dfWage<-mutate(dfWage,qSta=ntile(dfCrime$wsta,4))
dfWage<-mutate(dfWage,qLoc=ntile(dfCrime$wloc,4))
## Average quantile
dfWage$qAvg= (dfWage$qCon+dfWage$qTuc+dfWage$qTrd+dfWage$qFir+dfWage$qSer+dfWage$qMfg+dfWage$qFed+dfWage$qSta+dfWage$qLoc)/9
plot(dfCrime$allWages,dfWage$qAvg)
hist(dfCrime$allWages)
ggplot(data=dfCrime, aes(sample= allWages)) + stat_qq() + stat_qq_line() + ggtitle("QQ Plot of sum of wages")
mod1 <- lm(dfCrime$logcrmrte ~ dfCrime$allWages + dfCrime$logcrimJustEff)
(mod1)
Call:
lm(formula = dfCrime$logcrmrte ~ dfCrime$allWages + dfCrime$logcrimJustEff)
Coefficients:
(Intercept) dfCrime$allWages dfCrime$logcrimJustEff
-6.3008940 0.0006381 -1.0014763
summary(mod1)$adj.r.square
[1] 0.4559566
## will be details on effect size and standard error as we cover this in class.
plot(mod1, which=5)
plot(mod1, which=2)
plot(mod1, which=3)
plot(mod1, which=1)
The model shows a moderate good fit, with an adjusted R square of 0.46. This can be interpreted as, the model explains 46% of the variation in crime. Next the model is plotted in a Residuals vs Leverage plot. This plot shows that all the points have a cook’s distance of less than 0.5. There are no points that have enough leverage and residual than when deleted greatly alter the model coefficients.
The root of standardized residuals all fall within about 1.6. This is very good, as we can expect 95% of the points to fall within 3 standardized residuals of each other. (\(\sqrt(3) \approx 1.73\))
Finally, the residuals vs fitted plot shows a well centered and mostly nromal distribution about 0. There are no major trends or variation changes across the fitted values. This suggests that major uncorrelated variables have not been left out of the model. We will discuss the possible ommited variable biases further, in the next sections.
Model 1 CLM Assumptions: [To be finalized] * MLR1 Linear in paramters: The model has had its data transformed as described above to allow a linear fit of the model. * MLR2 Random Sampling: The data is collected from a data set with rolled up data for each county. It is not randomly sampled by area or population. * MLR3 No perfect multicollinearity: None of the variables chosen for the model are constant or perfectly collinear as the economy and criminal justice effectiveness are independent. * MLR4’ The expectation of u and and covariance of each regressor with u are ~0. This shows that our model’s regressors are exogenous with the error.
By satisfying these assumptions, we can expect that our coefficients are approaching the true parameter values in probability.
##MLR 5,6 to be discussed in week 13…?
cov(resid(mod1), dfCrime$allWages)
[1] 2.778984e-14
cov(resid(mod1), log(dfCrime$crimJustEff))
[1] -2.644899e-18
mean(resid(mod1))
[1] -5.575418e-18
In this model, we introduce the additional covariates of population per square mile (density), tax per capita (taxpc) and police per capita (polpc) to increase the accuracy of our regression. We are including these additional variables to our second model, as they add accuracy to the explanatory variables used in our first model:
\[log(crmrate) = \beta_0 + \beta_1crimjusteff + \beta_2log(polpc) + \beta_3density + \beta_4allWages + \beta_5taxpc + u\]
corrplot(cor(dfCrime[,c("logcrmrte", "logcrimJustEff", "logprbarr", "logprbconv", "prbpris", "polpc", "taxpc", "allWages", "urban", "density", "pctymle", "pctmin80")]),method='circle', type = 'lower')
par(mfrow = c(2,2))
hist(dfCrime$polpc, breaks=25)
hist(dfCrime$taxpc, breaks=25)
hist(dfCrime$density, breaks=25)
par(mfrow = c(3,2))
hist(dfCrime$polpc, main="Hist of polpc")
hist(dfCrime$logpolpc, main="Hist of Log10 logpolpc")
hist(dfCrime$taxpc, main="Hist of taxpc")
hist(dfCrime$logtaxpc, main="Hist of Log10 logtaxpc")
hist(dfCrime$density, main="Hist of density")
hist(dfCrime$logdensity, main="Hist of Log10 logdensity")
# par(mfrow = c(2,2))
# plot(dfCrime$logcrimJustEff, dfCrime$polpc, main = 'polpc vs logcrimJustEff', xlab='logcrimJustEff', ylab='polpc')
# plot(dfCrime$logcrimJustEff, dfCrime$logpolpc, main = 'logpolpc vs logcrimJustEff', xlab='logcrimJustEff', ylab='logpolpc')
# plot(dfCrime$logcrimJustEff, dfCrime$taxpc, main = 'taxpc vs logcrimJustEff', xlab='logcrimJustEff', ylab='taxpc')
# plot(dfCrime$logcrimJustEff, dfCrime$logtaxpc, main = 'logtaxpc vs logcrimJustEff', xlab='logcrimJustEff', ylab='logtaxpc')
In the histograms above, we see that the both polpc and taxpc exhibit right skew. Taking the \(log_{10}\) of polpc brings the distribution closer to normal. However, the \(log\) of taxpc and density makes the distributions even more skewed.
As a result, we will use the \(log\) of polpc (logpolpc) in our second model and will not transform the taxpc and density variables.
model2 <- lm(logcrmrte ~ logcrimJustEff + logpolpc + allWages + taxpc + density, data = dfCrime)
model2
Call:
lm(formula = logcrmrte ~ logcrimJustEff + logpolpc + allWages +
taxpc + density, data = dfCrime)
Coefficients:
(Intercept) logcrimJustEff logpolpc allWages
-2.2783403 -0.6345041 0.3984752 0.0002599
taxpc density
-0.0043953 0.1118637
summary(model2)
Call:
lm(formula = logcrmrte ~ logcrimJustEff + logpolpc + allWages +
taxpc + density, data = dfCrime)
Residuals:
Min 1Q Median 3Q Max
-1.11981 -0.17313 -0.02517 0.27332 0.65535
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.2783403 1.1634826 -1.958 0.053565 .
logcrimJustEff -0.6345041 0.1810816 -3.504 0.000741 ***
logpolpc 0.3984752 0.1396949 2.852 0.005474 **
allWages 0.0002599 0.0001548 1.679 0.096908 .
taxpc -0.0043953 0.0045553 -0.965 0.337412
density 0.1118637 0.0379080 2.951 0.004117 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3541 on 83 degrees of freedom
Multiple R-squared: 0.5669, Adjusted R-squared: 0.5408
F-statistic: 21.73 on 5 and 83 DF, p-value: 7.614e-14
The Adjusted R-squared variable penalizes for additional variables, which means there is a chance that this value will decrease if the added variables do not contribute to the model. By comparing the Adjusted R-squared value between our first and second models, we see that log(polpc), taxpc and density help describe log(crmrate). Our second model has an Adjusted R-squared value of 0.5004, which means 50.04% of the variation in the \(log_{10}\) of crime rate is explained by the explanatory variables used in this model. This is a significant increase compared to our first model, that has an Adjusted R-squared value of 0.4520.
In addition, the F-statistic is 16.62 with a statistically significant p-value of < 6.263e-11. As a result, we reject the null hypothesis that none of the independent variables help to describe log(crmrate).
Coefficient Analysis (assuming ceterus paribus): - logcrimJustEff: -0.1607. This suggests that for a 1% increase in criminal justice efficiency, there is a 0.1607% decrease in crime rate. - logpolpc: 0.3701. This suggests that for a 1% increase in police per capita, there is a 0.3701% increase in crime rate. - allWages: 0.00006692. This suggests that for a 1% increase in total average weekly wage, there is a 0.0067% increase in crime rate. - taxpc: -0.001632. This suggests that for a 1% increase in tax per capita, there is a 0.1632% decrease in crime rate. - density: 0.06259. This suggests that for a 1% increase in density, there is a 6.259% increase in crime rate.
Model 2 CLM Assumptions: [To be Finalized] * MLR1 Linear in paramters: The model has had its data transformed as described above to allow a linear fit of the model. * MLR2 Random Sampling: The data is collected from a data set with rolled up data for each county. It is not randomly sampled by area or population. * MLR3 No perfect multicollinearity: None of the variables chosen for the model are constant or perfectly collinear as the economy and criminal justice effectiveness are independent. * MLR4’ The expectation of u and and covariance of each regressor with u are ~0. This shows that our model’s regressors are exogenous with the error.
By satisfying these assumptions, we can expect that our coefficients are approaching the true parameter values in probability.
##MLR 5,6 to be discussed in week 13…?
Compared to model 1, the adjusted \(R^2\) of model 2 is only marginally higher. This suggests that we should continue our analysis by focusing on the join significance of the variables added in model 2.
Despite the improvements in the accuracy of model 2 over model 1, we are still only explaining about 55% of the variation in our data. As a result, we propose to also analyse the topic of demographics which could have an effec on both of our key explanatory variables.
One key component of demographics is the race of the county inhabitants and how they are perceived and treated by others, especially for minorities in the population. For example, systemic racism could have an important effect on: * Criminal Justice Effectiveness: If police, lawyers and judges are racially biased, this could lead to more arrests and more convictions regardless of the strength of the legal case and the evidence. As a result, we hypothesize the crime rate would increase. * Economic Opportunity: Racism could prohibit members of the minority from having access to education, jobs and higher wages. Racism could also limit access to healthcare and social programmes which has a negative effect on economic opportunity.
However, since we cannot directly measure racism, we have to operationalize this covariate by examining its effect in the real world. We propose to use the variable pctmin80, which represents the percentage of minorities in the population of the county. This is a good indicator that is also a linear parameter: given a higher the percentage of minorities, we should expect to see a greater effect.
We propose to operationalize gender and age with the variable
We have also chosen not to include other variables from our dataset in our model: * Region: While geographical indicators are also important, particularly as they may represent clusters of jobs and skilled workers, it is not a linear parameter (i.e. we can not simply increase a region by “1” and expect to see an effect on the crime rate.“) * Urban: We believe the variable”density" better explains the same effects as “urban”, while also being a linear parameter. In addition, there may be data points that failed to meet the cutoff for being defined as urban, but may still see the same effects as being urban and hence may distort our analysis. * Age and Gender: While age and gender are important demographic variables, the only variable in our dataset is pctymle which provides the percentage of young males in the population. However, given that this variable encompasses both male and young, we may not be able to discern if age or gender has the larger effect (if any at all).
Percentage Minority: From the summary and boxplot below, we can see that the percentage of minorities ranges from 0.0154 - 0.6435, with a mean of 0.2621. We note that there are no major outliers.
In addition from the scatterplots below, we see that using applying log10 on pctmin80 exposes a more linear relationship with the points more balanced on either side of the trendline. As a result, we will use the log-transformed version of pctmin80.
summary(dfCrime$pctmin80)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.01541 0.10084 0.25391 0.25987 0.38223 0.64348
boxplot(dfCrime$pctmin80)
plot(dfCrime$pctmin80, dfCrime$logcrmrte)
abline(lm(dfCrime$logcrmrte~dfCrime$pctmin80))
plot(dfCrime$logpctmin80, dfCrime$logcrmrte)
abline(lm(dfCrime$logcrmrte~dfCrime$logpctmin80))
model3<-lm(logcrmrte ~ logcrimJustEff + logpolpc + allWages + taxpc + density + logpctmin80 , data = dfCrime)
summary(model3)
Call:
lm(formula = logcrmrte ~ logcrimJustEff + logpolpc + allWages +
taxpc + density + logpctmin80, data = dfCrime)
Residuals:
Min 1Q Median 3Q Max
-0.79245 -0.12650 0.01757 0.14956 0.68936
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.0447465 0.9226341 -2.216 0.029448 *
logcrimJustEff -0.8668274 0.1472063 -5.889 8.24e-08 ***
logpolpc 0.4066755 0.1107124 3.673 0.000426 ***
allWages 0.0003211 0.0001230 2.611 0.010732 *
taxpc -0.0081348 0.0036484 -2.230 0.028503 *
density 0.0899602 0.0302004 2.979 0.003806 **
logpctmin80 0.2399710 0.0338834 7.082 4.44e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2806 on 82 degrees of freedom
Multiple R-squared: 0.7313, Adjusted R-squared: 0.7116
F-statistic: 37.2 on 6 and 82 DF, p-value: < 2.2e-16
The model shows a good fit, with an adjusted R-squared of 0.7322, meaning that the model explains 73% of the variation in crime.
For all of our 6 different independent variables, we note each of them have statistical significance at the 95% level or better. Of these 6, criminal justice efficiency, minority percentages and density are the most significant.
Interpretation of coefficients (Assuming ceterus paribus):
Positive coefficients: * Police presence: If we increase police per capita by 1 unit, we expect the crime rate to increase by 33%. * AllWages: If we increase wages by 1 dollar, we expect the crime rate to increase by 0.01% * Density: If we increase density by 1 unit, we expect the crime rate to increase by 5% * Percentage of minorities: If the percentage of minorities increase by 1%, we expect the crime rate to increase by 0.24%
Negative coefficients: * Criminal justice efficiency: If we increase the criminal justice efficiency by 1%, we expect the crime rate to decrease by 0.34%. * Tax per capita: If we increase tax per capita by 1 unit, we expect the crime rate to decrease by 0.35%
In addition, the F-statistic is 40.04 with a statistically significant p-value of < 2.2e-11. As a result, we reject the null hypothesis that none of the independent variables help to describe log(crmrate).
[Comment on standard error]
In the Residuals vs Leverage plot below, all the points have a cook’s distance of less than 0.5. While there is a point with 0.6 leverage, there are no points that have residual that greatly alter the model coefficients.
The root of standardized residuals all fall within about 1.6. This is very good, as we can expect 95% of the points to fall within 3 standardized residuals of each other. ( (⎯⎯√3)≈1.73 )
Finally, the residuals vs fitted plot shows a well centered and mostly normal distribution about 0. There are no major trends or variation changes across the fitted values. This suggests that major uncorrelated variables have not been left out of the model.
plot(model3, which = 5)
plot(mod1, which=3)
plot(model3, which=1)
Model 3 CLM Assumptions: [To be finalized] * MLR1 Linear in paramters: The model has had its data transformed as described above to allow a linear fit of the model. * MLR2 Random Sampling: The data is collected from a data set with rolled up data for each county. It is not randomly sampled by area or population. * MLR3 No perfect multicollinearity: None of the variables chosen for the model are constant or perfectly collinear with each other. Our new variables for percentage minority and percentage young males may have some relationships with the other variables but they are not perfectly colinear as noted from the scatterplot matrix in our EDA. * MLR4’ The expectation of u and and covariance of each regressor with u are ~0. This shows that our model’s regressors are exogenous with the error.
By satisfying these assumptions, we can expect that our coefficients are approaching the true parameter values in probability.
##MLR 5,6 to be discussed in week 13…?
*Can anyone figure out why logcrimJustEff is on 2 lines?
stargazer(mod1,model2,model3,type="text")
========================================================================================
Dependent variable:
--------------------------------------------------------------------
logcrmrte logcrmrte
(1) (2) (3)
----------------------------------------------------------------------------------------
allWages 0.001***
(0.0001)
logcrimJustEff -1.001***
(0.174)
logcrimJustEff -0.635*** -0.867***
(0.181) (0.147)
logpolpc 0.398*** 0.407***
(0.140) (0.111)
allWages 0.0003* 0.0003**
(0.0002) (0.0001)
taxpc -0.004 -0.008**
(0.005) (0.004)
density 0.112*** 0.090***
(0.038) (0.030)
logpctmin80 0.240***
(0.034)
Constant -6.301*** -2.278* -2.045**
(0.374) (1.163) (0.923)
----------------------------------------------------------------------------------------
Observations 89 89 89
R2 0.468 0.567 0.731
Adjusted R2 0.456 0.541 0.712
Residual Std. Error 0.385 (df = 86) 0.354 (df = 83) 0.281 (df = 82)
F Statistic 37.876*** (df = 2; 86) 21.732*** (df = 5; 83) 37.195*** (df = 6; 82)
========================================================================================
Note: *p<0.1; **p<0.05; ***p<0.01
Comparing the 3 models, we see that our adjusted R2 value has steadily increased from 0.456-0.732 as we introduce more covariates which indicates that we were able to explain more variation in our model not purely by increasing the number of indepedent variables.
At the same time, our standard errors have decreased insert more commentary on standard errors.
We see that by expanding our definitions of criminal justice efficiency and economic opportunity between model 1 and model 3 lowered the coefficients for logcrimJustEff and allWages. This is most likely because that we were able to better explain the effects with our newer variables.
Comment on practical significance after week 12
Given that across all 3 models, we show that both criminal justice efficiency and tax revenues per capita have negative correlations to crime rate, we propose the policy recommendations below to address these issues. In addition, since minority percentages and density were found to be highly significant in the model 3, we believe our recommendations will be of particularly help to those running for political office in counties with a high percentage of minorities or dense urban populations.
Since increasing both criminal justice and tax revenues are negatively correlated, we propose providing more funding for the local justice system.
While increasing taxes on constituents may be difficult politically and may cost candidates the ballot, candidates can instead try to attract investment to bring more jobs with higher wages so you can increase revenues.
Candidates can also propose to levy taxes on things that could lead to crimes or violence such as alcohol and weapons.
Given the significance and relatively large coefficient size of percentage minority, candidates should enroll local law enforcement into bias training.
| Expected correlation between omitted and included variables |
|---|
| Omitted Variable | Crime Rate (\(B_k\)) | Criminal Justice Effectiveness | Economic Conditions |
|---|---|---|---|
| Education | - | unknown | + |
| Social Services | - | unknown | unknown |
| Unemployment | + | unknown | - |
| Gang Activity | + | - | - |
The 4 major identified ommited variables are shown above. * Education is an important variable because of demographic insights it provides. First, adults with higher education are less likely to participate in Crime and are more likely to have better economic opportunity. Second, a strong school system is also likely correlated with less youth crime. Because of these expected correlations we are likely overestimating the economic conditions coefficient estimate. * Available Social Services could also lower crime. Citizens with strong social services support have more options to get help when they lack means for purchasing basic life needs. However this is more difficult to predict, as some social service projects, like homeless shelters, could lead to more criminal activity. * Unemployment is used as an important indicator of economic health and opportunity. This is would be highly correlated to economic conditions variables like sum of wages. This indicator variable if added to the model would decrease the magnitude of the sum of wage means coefficient estimate.
* Gang or Organized Crime is special case of crime that contains unique causes. It is expected that it would be negatively correlated with criminal justice effectiveness as large social pressures prevent witnesses from supporting prosecution. Gang crime is also negatively correlated with economic conditions. From these assumed correlations, we can say that criminal justice effectiveness and economic conditions are both underestimated compared to including gang activity operationalized variable in the model.
We have shown in this report 3 different models that seek to explain and model changes in the crime rate in North Carolina in 1980. We start with the fundamental premise that crime is caused by both criminal justice efficiency and economic conditions, and further develop our definition of these two key explanatory variables which each new model.
In Model 3, we were able to explain up to 73% of the variation in our data, and found statistical significance at the 95% level or better for each of our covariates. Of these, we believe that increasing the efficiency of the criminal justice system and tax revenues were the most important, particularly for counties with high density and minority populations. However, our findings should be noted with caution as we were unable to study the effect of several ommitted variables including education, availability of social services, unemployment rates and the presence of organized crime. Had we been able to collect data on these variables and apply them in our model, we believe we could increase accuracy without bias.
#for the metro areas:
#increases in wsta and wtuc correlate with less crime
#increases in wfed and wser correlate with the same crime (ceterus paribus)
#increases in wcon, wtrd, wfir, wmfg and wloc correlate with more crime
#for the non-metro areas
#no increase in wage correlates with crime rates going down. They all go up.
# fit parallel slopes
mod<-lm(logcrmrte~wsta*metro, data=dfCrime)
#mod
# Augment the model
augmented_mod <- augment(mod)
#glimpse(augmented_mod)
# scatterplot, with color
data_space <- ggplot(augmented_mod, aes(x = wsta, y = logcrmrte, color = metro)) +
geom_point()
# single call to geom_line()
data_space +
geom_line(aes(y = .fitted))
# interaction plot
ggplot(dfCrime, aes(y = crmrte, x = wsta, color = metro)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
# fit parallel slopes
mod<-lm(crmrte~wsta*metro, data=dfCrime)
#mod
# Augment the model
augmented_mod <- augment(mod)
#glimpse(augmented_mod)
# scatterplot, with color
data_space <- ggplot(augmented_mod, aes(x = wsta, y = crmrte, color = metro)) +
geom_point()
# single call to geom_line()
data_space +
geom_line(aes(y = .fitted))
slr <- ggplot(dfCrime, aes(y = crmrte, x = wsta)) +
geom_point() +
geom_smooth(method = "lm", se = 0)
# model with one slope
lm(crmrte ~ wsta, data = dfCrime)
Call:
lm(formula = crmrte ~ wsta, data = dfCrime)
Coefficients:
(Intercept) wsta
3.617e-03 8.439e-05
# plot with two slopes
slr + aes(color = metro)
newmod<-lm(crmrte ~ wsta + metro + other + prbarr + wsta*metro + prbarr*other, data = dfCrime)
summary(newmod)
Call:
lm(formula = crmrte ~ wsta + metro + other + prbarr + wsta *
metro + prbarr * other, data = dfCrime)
Residuals:
Min 1Q Median 3Q Max
-0.026251 -0.008248 -0.000030 0.006076 0.043253
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.203e-02 1.579e-02 2.028 0.0458 *
wsta 1.313e-05 3.755e-05 0.350 0.7274
metroInside 1.068e-01 5.064e-02 2.109 0.0380 *
other 1.379e-02 8.507e-03 1.621 0.1089
prbarr -3.558e-02 1.892e-02 -1.881 0.0636 .
wsta:metroInside -1.780e-04 1.284e-04 -1.386 0.1695
other:prbarr -1.570e-02 2.761e-02 -0.569 0.5710
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.01344 on 82 degrees of freedom
Multiple R-squared: 0.5212, Adjusted R-squared: 0.4862
F-statistic: 14.88 on 6 and 82 DF, p-value: 1.968e-11
coeftest(newmod, vcov=vcovHC)
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.2027e-02 2.0230e-02 1.5831 0.11725
wsta 1.3132e-05 5.2066e-05 0.2522 0.80150
metroInside 1.0682e-01 9.2894e-02 1.1500 0.25350
other 1.3787e-02 8.8453e-03 1.5586 0.12294
prbarr -3.5584e-02 1.9880e-02 -1.7899 0.07716 .
wsta:metroInside -1.7803e-04 2.2446e-04 -0.7931 0.42999
other:prbarr -1.5704e-02 2.3876e-02 -0.6578 0.51253
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
newmod<-lm(crmrte ~ regcode:log(allWages) + metro:(logcrimJustEff) , data = dfCrime)
summary(newmod)
Call:
lm(formula = crmrte ~ regcode:log(allWages) + metro:(logcrimJustEff),
data = dfCrime)
Residuals:
Min 1Q Median 3Q Max
-0.031821 -0.005497 -0.000697 0.004864 0.035446
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.384045 0.103166 -3.723 0.000360 ***
regcodeO:log(allWages) 0.049436 0.013005 3.801 0.000276 ***
regcodeW:log(allWages) 0.047364 0.012986 3.647 0.000464 ***
regcodeC:log(allWages) 0.048438 0.012890 3.758 0.000320 ***
metroOutside:logcrimJustEff -0.031162 0.005163 -6.036 4.39e-08 ***
metroInside:logcrimJustEff -0.049307 0.005306 -9.293 1.88e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.01063 on 82 degrees of freedom
(1 observation deleted due to missingness)
Multiple R-squared: 0.6967, Adjusted R-squared: 0.6782
F-statistic: 37.66 on 5 and 82 DF, p-value: < 2.2e-16
newmod %>% augment
# A tibble: 88 x 13
.rownames crmrte regcode log.allWages. metro logcrimJustEff .fitted
<chr> <dbl> <fct> <dbl> <fct> <dbl> <dbl>
1 1 0.0356 C 8.02 Outs… -0.803 0.0297
2 2 0.0153 C 7.88 Outs… -0.709 0.0199
3 3 0.0130 W 7.85 Outs… -0.924 0.0163
4 4 0.0268 C 7.95 Outs… -0.717 0.0232
5 5 0.0106 W 7.92 Outs… -0.607 0.0101
6 6 0.0146 W 7.86 Outs… -1.45 0.0332
7 7 0.0296 O 7.93 Outs… -0.721 0.0302
8 8 0.0203 O 7.90 Outs… -0.521 0.0225
9 9 0.0304 O 7.84 Outs… -0.959 0.0336
10 10 0.0222 O 8.05 Outs… -0.700 0.0360
# … with 78 more rows, and 6 more variables: .se.fit <dbl>, .resid <dbl>,
# .hat <dbl>, .sigma <dbl>, .cooksd <dbl>, .std.resid <dbl>
newmod<-lm(crmrte ~ regcode:log(allWages) + metro + logcrimJustEff + metro*logcrimJustEff, data = dfCrime)
summary(newmod)
Call:
lm(formula = crmrte ~ regcode:log(allWages) + metro + logcrimJustEff +
metro * logcrimJustEff, data = dfCrime)
Residuals:
Min 1Q Median 3Q Max
-0.031480 -0.005701 -0.000744 0.004381 0.035698
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.395122 0.104840 -3.769 0.000310 ***
metroInside -0.028355 0.042496 -0.667 0.506508
logcrimJustEff -0.030767 0.005214 -5.901 8.05e-08 ***
regcodeO:log(allWages) 0.050876 0.013226 3.847 0.000238 ***
regcodeW:log(allWages) 0.048818 0.013212 3.695 0.000398 ***
regcodeC:log(allWages) 0.049863 0.013109 3.804 0.000275 ***
metroInside:logcrimJustEff -0.041447 0.035154 -1.179 0.241844
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.01066 on 81 degrees of freedom
(1 observation deleted due to missingness)
Multiple R-squared: 0.6983, Adjusted R-squared: 0.676
F-statistic: 31.25 on 6 and 81 DF, p-value: < 2.2e-16
newmod<-lm(crmrte ~ regcode:log(allWages) + metro*logcrimJustEff, data = dfCrime)
summary(newmod)
Call:
lm(formula = crmrte ~ regcode:log(allWages) + metro * logcrimJustEff,
data = dfCrime)
Residuals:
Min 1Q Median 3Q Max
-0.031480 -0.005701 -0.000744 0.004381 0.035698
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.395122 0.104840 -3.769 0.000310 ***
metroInside -0.028355 0.042496 -0.667 0.506508
logcrimJustEff -0.030767 0.005214 -5.901 8.05e-08 ***
regcodeO:log(allWages) 0.050876 0.013226 3.847 0.000238 ***
regcodeW:log(allWages) 0.048818 0.013212 3.695 0.000398 ***
regcodeC:log(allWages) 0.049863 0.013109 3.804 0.000275 ***
metroInside:logcrimJustEff -0.041447 0.035154 -1.179 0.241844
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.01066 on 81 degrees of freedom
(1 observation deleted due to missingness)
Multiple R-squared: 0.6983, Adjusted R-squared: 0.676
F-statistic: 31.25 on 6 and 81 DF, p-value: < 2.2e-16
crmrte + prbarr + prbconv + prbpris + avgsen + polpc + density + taxpc + pctmin80 + wcon + wtuc + wtrd + wfir + wser + wmfg + wfed + wsta + wloc + mix + pctymle
#Other
options(repr.plot.width=10, repr.plot.height=10)
myData<-dfCrime %>% filter(region==0)
myData<-myData[, c("crmrte", "prbarr", "prbconv", "prbpris", "avgsen", "polpc", "density", "taxpc",
"pctmin80", "wcon", "wtuc", "wtrd", "wfir", "wser", "wmfg", "wfed", "wsta", "wloc",
"mix", "pctymle")]
myData %>% correlate() %>% network_plot()
Correlation method: 'pearson'
Missing treated using: 'pairwise.complete.obs'
ggcorr(myData, nbreaks=8, palette='RdBu', label=TRUE, label_size=5, label_color='white')
#Western
myData<-dfCrime %>% filter(region==1)
myData<-myData[, c("crmrte", "prbarr", "prbconv", "prbpris", "avgsen", "polpc", "density", "taxpc",
"pctmin80", "wcon", "wtuc", "wtrd", "wfir", "wser", "wmfg", "wfed", "wsta", "wloc",
"mix", "pctymle")]
myData %>% correlate() %>% network_plot()
Correlation method: 'pearson'
Missing treated using: 'pairwise.complete.obs'
ggcorr(myData, nbreaks=8, palette='RdBu', label=TRUE, label_size=5, label_color='white')
#Central
myData<-dfCrime %>% filter(region==2)
myData<-myData[, c("crmrte", "prbarr", "prbconv", "prbpris", "avgsen", "polpc", "density", "taxpc",
"pctmin80", "wcon", "wtuc", "wtrd", "wfir", "wser", "wmfg", "wfed", "wsta", "wloc",
"mix", "pctymle")]
myData %>% correlate() %>% network_plot()
Correlation method: 'pearson'
Missing treated using: 'pairwise.complete.obs'
ggcorr(myData, nbreaks=8, palette='RdBu', label=TRUE, label_size=5, label_color='white')
#metro = urban
myData<-dfCrime %>% filter(urban==1)
myData<-myData[, c("crmrte", "prbarr", "prbconv", "prbpris", "avgsen", "polpc", "density", "taxpc",
"pctmin80", "wcon", "wtuc", "wtrd", "wfir", "wser", "wmfg", "wfed", "wsta", "wloc",
"mix", "pctymle")]
myData %>% correlate() %>% network_plot()
Correlation method: 'pearson'
Missing treated using: 'pairwise.complete.obs'
ggcorr(myData, nbreaks=8, palette='RdBu', label=TRUE, label_size=5, label_color='white')
#metro = non-urban
myData<-dfCrime %>% filter(urban==0)
myData<-myData[, c("crmrte", "prbarr", "prbconv", "prbpris", "avgsen", "polpc", "density", "taxpc",
"pctmin80", "wcon", "wtuc", "wtrd", "wfir", "wser", "wmfg", "wfed", "wsta", "wloc",
"mix", "pctymle")]
myData %>% correlate() %>% network_plot()
Correlation method: 'pearson'
Missing treated using: 'pairwise.complete.obs'
ggcorr(myData, nbreaks=8, palette='RdBu', label=TRUE, label_size=5, label_color='white')
#metro = urban
myData<-dfCrime %>% filter(urban==1)
myData<-myData[, c("crmrte", "prbarr", "prbconv", "prbpris", "avgsen", "polpc", "density", "taxpc",
"pctmin80", "wcon", "wtuc", "wtrd", "wfir", "wser", "wmfg", "wfed", "wsta", "wloc",
"mix", "pctymle")]
#More correlation charts -
chart.Correlation(myData, histogram=TRUE, pch=19)
pairs.panels(myData, scale=TRUE)
ggpairs(myData)
ggcorrplot(cor(myData), p.mat = cor_pmat(myData), hc.order=TRUE, type='lower')
corrplot(cor(myData),method='circle', type = 'lower')
myData<-dfCrime[, c("county", "urban", "nonurban", "west", "central", "other", "density", "crmrte", "prbarr", "prbconv", "prbpris", "avgsen", "polpc", "taxpc",
"pctmin80", "wcon", "wtuc", "wtrd", "wfir", "wser", "wmfg", "wfed", "wsta", "wloc",
"mix", "pctymle")]
myData %>% arrange(desc(density))
county urban nonurban west central other density crmrte prbarr
1 119 1 0 0 1.0 0 8.8276520 0.0989659 0.149094
2 67 1 0 0 1.0 0 6.4271846 0.0614177 0.217215
3 129 1 0 0 0.0 1 6.2864866 0.0834982 0.236601
4 63 1 0 0 1.0 0 5.6744967 0.0706599 0.133225
5 81 1 0 0 1.0 0 5.1244240 0.0604498 0.300215
6 71 1 0 0 0.5 0 4.8347340 0.0544061 0.243119
7 183 1 0 0 1.0 0 4.3887587 0.0568423 0.204216
8 51 1 0 0 0.0 1 3.9345510 0.0883849 0.155248
9 35 0 1 0 1.0 0 2.9242425 0.0408569 0.266026
10 21 1 0 1 0.0 0 2.6024280 0.0437355 0.234760
11 25 0 1 0 1.0 0 2.5741758 0.0302542 0.323548
12 1 0 1 0 1.0 0 2.4226327 0.0356036 0.298270
13 57 0 1 0 1.0 0 2.2518249 0.0300216 0.222002
14 135 0 1 0 1.0 0 2.1575000 0.0628972 0.092770
15 159 0 1 0 1.0 0 2.0192678 0.0362330 0.243590
16 45 0 1 0 1.0 0 1.8440171 0.0362807 0.202627
17 89 0 1 1 0.0 0 1.8155080 0.0283836 0.296646
18 191 0 1 0 0.0 1 1.7725632 0.0458895 0.172257
19 195 0 1 0 0.0 1 1.7459893 0.0313973 0.201397
20 133 0 1 0 0.0 1 1.6500655 0.0551287 0.266960
21 105 0 1 0 1.0 0 1.5984555 0.0514152 0.381400
22 109 0 1 0 1.0 0 1.5939597 0.0230995 0.162769
23 181 0 1 0 1.0 0 1.5702811 0.0729479 0.182590
24 97 0 1 0 1.0 0 1.5662020 0.0334506 0.302231
25 147 0 1 0 0.0 1 1.5159818 0.0551686 0.221542
26 23 0 1 1 0.0 0 1.5119047 0.0269836 0.289121
27 157 0 1 0 1.0 0 1.5096661 0.0305908 0.278287
28 173 0 1 1 0.0 0 1.5054216 0.0139937 0.530435
29 107 0 1 0 0.0 1 1.5024875 0.0497552 0.212895
30 27 0 1 1 0.0 0 1.4989384 0.0382489 0.268018
31 127 0 1 0 0.0 1 1.3388889 0.0291496 0.179616
32 139 0 1 0 0.0 1 1.3333334 0.0243470 0.522696
33 179 0 1 0 1.0 0 1.2816901 0.0318720 0.377543
34 167 0 1 0 1.0 0 1.2752526 0.0238285 0.362270
35 151 0 1 0 1.0 0 1.2737643 0.0264557 0.299198
36 65 0 1 0 0.0 1 1.1679842 0.0658801 0.287330
37 171 0 1 1 0.0 0 1.1521336 0.0243954 0.175649
38 49 0 1 0 0.0 1 1.1440799 0.0374979 0.264420
39 155 0 1 0 0.0 1 1.1296101 0.0312279 0.408200
40 189 0 1 1 0.0 0 1.1019108 0.0313130 0.161381
41 85 0 1 0 0.0 1 1.0815308 0.0490712 0.146132
42 165 0 1 0 0.0 1 1.0783699 0.0508341 0.340679
43 3 0 1 0 1.0 0 1.0463320 0.0152532 0.132029
44 59 0 1 0 1.0 0 1.0262172 0.0233327 0.227753
45 161 0 1 1 0.0 0 1.0052817 0.0200070 0.482425
46 101 0 1 0 0.0 1 0.9962264 0.0409403 0.149936
47 153 0 1 0 1.0 0 0.9622641 0.0317563 0.345368
48 197 0 1 1 0.0 0 0.8898810 0.0141928 0.207595
49 87 0 1 1 0.0 0 0.8648649 0.0286781 0.215108
50 111 0 1 1 0.0 0 0.8306636 0.0183048 0.202112
51 125 0 1 0 1.0 0 0.8202568 0.0266287 0.269043
52 193 0 1 1 0.0 0 0.8138298 0.0235277 0.266055
53 169 0 1 0 1.0 0 0.8008850 0.0121033 0.343387
54 83 0 1 0 0.0 1 0.7817680 0.0315752 0.456394
55 145 0 1 0 1.0 0 0.7788945 0.0299856 0.354733
56 41 0 1 0 0.0 1 0.7417582 0.0257713 0.307246
57 77 0 1 0 1.0 0 0.7172285 0.0441957 0.190876
58 69 0 1 0 1.0 0 0.7125506 0.0173158 0.283505
59 175 0 1 1 0.0 0 0.6878307 0.0164932 0.350348
60 91 0 1 0 0.0 1 0.6713483 0.0384263 0.343074
61 79 0 1 0 0.0 1 0.6203008 0.0156759 0.411538
62 11 0 1 1 0.0 0 0.6113361 0.0146067 0.524664
63 93 0 1 0 0.0 1 0.6112532 0.0362222 0.338902
64 149 0 1 1 0.0 0 0.6092437 0.0164987 0.271967
65 117 0 1 0 0.0 1 0.5813449 0.0268723 0.370474
66 19 0 1 0 0.0 1 0.5767442 0.0221567 0.162860
67 47 0 1 0 0.0 1 0.5639659 0.0313623 0.182927
68 99 0 1 1 0.0 0 0.5478615 0.0171865 0.153846
69 9 0 1 1 0.0 0 0.5469484 0.0106232 0.518219
70 163 0 1 0 0.0 1 0.5353749 0.0215728 0.310987
71 53 0 1 0 0.0 1 0.5351562 0.0140655 0.303191
72 33 0 1 0 1.0 0 0.5257009 0.0159189 0.270950
73 13 0 1 0 0.0 1 0.5169492 0.0296409 0.365004
74 37 0 1 0 1.0 0 0.5127119 0.0226017 0.321867
75 55 0 1 0 0.0 1 0.5115089 0.0790163 0.224628
76 61 0 1 0 0.0 1 0.5079365 0.0233677 0.398119
77 123 0 1 0 1.0 0 0.4938776 0.0300184 0.487430
78 7 0 1 0 1.0 0 0.4915572 0.0267532 0.364760
79 39 0 1 1 0.0 0 0.4623894 0.0119154 0.308333
80 113 0 1 1 0.0 0 0.4487427 0.0142071 0.179878
81 187 0 1 0 0.0 1 0.4427711 0.0345231 0.332669
82 143 0 1 0 0.0 1 0.4349594 0.0265806 0.317857
83 5 0 1 1 0.0 0 0.4127659 0.0129603 0.444444
84 131 0 1 0 0.0 1 0.4126394 0.0189848 0.689024
85 185 0 1 0 1.0 0 0.3887588 0.0108703 0.195266
86 17 0 1 0 0.0 1 0.3503982 0.0304289 0.251599
87 137 0 1 0 0.0 1 0.3167155 0.0126662 0.207143
88 15 0 1 0 0.0 1 0.3009986 0.0202814 0.392111
89 141 0 1 0 0.0 1 0.3005714 0.0314610 0.238636
prbconv prbpris avgsen polpc taxpc pctmin80 wcon
1 0.3478000 0.486183 7.13 0.00223135 75.67243 0.2854600 436.7666
2 0.2482760 0.488426 10.57 0.00217747 45.89987 0.2565460 206.5527
3 0.3934130 0.415158 9.57 0.00255849 67.67963 0.2304410 315.5760
4 0.4592160 0.363636 11.51 0.00237609 50.19918 0.3822300 349.3267
5 0.2037250 0.431020 14.42 0.00243562 44.21059 0.2639410 404.8150
6 0.2295900 0.379175 11.29 0.00207028 31.53658 0.1331500 291.4508
7 0.3819080 0.367347 12.15 0.00212751 48.76492 0.2344360 360.9549
8 0.2598330 0.407628 11.93 0.00190802 35.69936 0.3777920 283.6695
9 0.3253010 0.370370 10.06 0.00189444 56.86211 0.1008380 346.5888
10 0.3347010 0.429072 10.62 0.00182958 52.62629 0.0962444 313.4738
11 0.4067800 0.492647 8.01 0.00197140 33.03621 0.1509980 315.7290
12 0.5275960 0.436170 6.71 0.00182786 30.99368 0.2021870 281.4259
13 0.7369090 0.320624 10.47 0.00136191 28.59199 0.1099570 324.6088
14 0.4777330 0.385593 11.92 0.00233871 35.99248 0.1952630 316.0858
15 0.4929400 0.476563 8.64 0.00158619 27.76489 0.1699130 334.1035
16 0.4505670 0.474820 8.96 0.00121531 30.84900 0.2174990 318.3808
17 0.3869260 0.415525 5.51 0.00126447 32.48096 0.0431205 293.8034
18 0.4500000 0.421053 9.59 0.00122733 32.74533 0.3442830 318.0599
19 1.6705199 0.470588 13.02 0.00445923 53.66693 0.3743110 315.1641
20 0.2719470 0.334951 8.99 0.00154457 27.46926 0.2638140 264.0406
21 0.3842360 0.381410 8.81 0.00195614 67.84798 0.2354980 314.0595
22 0.7816090 0.411765 9.12 0.00108043 27.32160 0.0968564 270.0951
23 0.3430230 0.548023 7.06 0.00172948 27.59179 0.4462830 244.8362
24 0.5950780 0.409774 6.62 0.00152665 29.97040 0.1840120 333.0927
25 0.4267780 0.443137 7.73 0.00218874 36.18621 0.3564110 295.1822
26 0.4037800 0.365957 7.07 0.00146111 29.08280 0.0793198 284.9890
27 0.4436810 0.377709 7.48 0.00191777 39.16965 0.2160360 317.5345
28 0.3278690 0.150000 6.64 0.00316379 37.72702 0.2539140 231.6960
29 0.3643530 0.450216 8.47 0.00168747 53.17796 0.3904780 279.4214
30 0.3529410 0.321429 11.69 0.00145013 43.06339 0.0645795 292.7350
31 1.3581400 0.335616 15.99 0.00158289 32.02376 0.3427990 290.9091
32 0.2894740 0.345455 14.22 0.00167448 29.49915 0.3806130 278.0824
33 0.3286640 0.426230 9.90 0.00147820 38.44067 0.1788580 314.4250
34 0.3225810 0.371429 10.48 0.00155144 35.09686 0.1257910 293.5538
35 0.3601530 0.340426 12.57 0.00132430 25.69287 0.0697606 340.4792
36 0.1544520 0.403922 9.84 0.00185739 30.62824 0.5169320 362.1527
37 0.9090910 0.458333 8.67 0.00157442 31.15306 0.0564349 385.3424
38 0.3718790 0.356890 8.70 0.00148532 39.23048 0.2990720 292.8322
39 0.5598230 0.386544 8.71 0.00145925 31.37446 0.6144970 253.2447
40 0.3005780 0.288462 12.27 0.00227837 31.33022 0.0198320 238.3758
41 0.5490200 0.428571 8.78 0.00143730 27.16143 0.2562870 245.8896
42 0.4685310 0.432836 7.42 0.00151382 38.74739 0.4242560 309.1764
43 1.4814800 0.450000 6.35 0.00074588 26.89208 0.0791632 255.1020
44 0.6225170 0.425532 6.50 0.00119655 41.07194 0.1087040 280.8989
45 0.5081970 0.451613 7.98 0.00124824 31.34530 0.1303660 275.4970
46 0.5714290 0.473881 9.65 0.00140045 34.87021 0.2094790 245.6298
47 0.5207100 0.458333 11.33 0.00138447 37.22833 0.2852840 262.6642
48 1.1829300 0.360825 12.23 0.00118573 25.95258 0.0546081 314.1660
49 0.5484950 0.341463 11.11 0.00169180 32.82694 0.0269921 250.7271
50 0.5223880 0.542857 11.06 0.00118719 29.58239 0.0552725 245.1514
51 0.4389610 0.396450 7.36 0.00200971 48.15414 0.2258190 261.5648
52 0.5888590 0.423423 5.86 0.00117887 28.51783 0.0593109 285.8289
53 0.7229730 0.448598 12.36 0.00109520 37.70785 0.0767092 345.6391
54 0.4572100 0.410256 7.85 0.00138532 41.08650 0.5056250 269.1710
55 0.3404910 0.594595 8.47 0.00137040 44.64758 0.3295160 305.3435
56 0.4528300 0.520833 17.41 0.00149399 41.76929 0.4264210 256.4102
57 0.5283020 0.488095 9.60 0.00246711 29.70588 0.4545130 254.7925
58 0.7393940 0.418033 9.10 0.00107108 35.37642 0.4232240 372.1622
59 0.4105960 0.387097 7.31 0.00164549 46.41461 0.0603408 253.3395
60 0.5899050 0.454545 8.79 0.00162189 31.85298 0.5680850 289.2227
61 0.3084110 0.454545 6.19 0.00102496 37.50189 0.4546750 223.9199
62 0.0683761 0.500000 13.00 0.00288203 35.22974 0.0154070 250.4006
63 0.5739440 0.484663 5.45 0.00142641 27.47967 0.5808270 231.1391
64 1.0153800 0.227273 14.62 0.00151871 29.03402 0.1000460 223.6136
65 0.7932330 0.236967 11.83 0.00119765 38.81493 0.4542550 242.3077
66 1.2256100 0.333333 10.34 0.00202425 61.15251 0.2431170 260.1381
67 0.7633330 0.270742 7.79 0.00128127 32.66050 0.3340320 367.8286
68 1.2343800 0.556962 14.75 0.00185912 39.57348 0.1428460 259.7841
69 0.4765630 0.442623 8.22 0.00086018 28.05474 0.0179619 292.3077
70 0.4011980 0.455224 11.25 0.00136587 34.01291 0.3634950 217.1781
71 0.1403510 0.250000 11.96 0.00112225 50.38139 0.1790960 266.4504
72 0.5154640 0.480000 7.32 0.00075593 27.38110 0.4391690 218.8868
73 0.5206070 0.420833 10.55 0.00133771 30.69649 0.3217940 238.3064
74 0.3854960 0.316832 8.69 0.00130501 34.70248 0.2780790 307.2780
75 0.2078310 0.304348 13.57 0.00400962 34.48254 0.0649622 309.5238
76 0.4934380 0.361702 8.77 0.00141622 32.59961 0.3499950 244.2002
77 0.2263610 0.443038 6.49 0.00176086 38.45734 0.2596020 345.1677
78 0.5254240 0.435484 7.14 0.00152994 42.94759 0.4791610 375.2345
79 0.9729730 0.291667 11.58 0.00119154 27.27564 0.0394549 277.5575
80 0.2203390 0.461538 6.39 0.00151600 40.80142 0.0239865 244.7552
81 0.4431140 0.432432 6.98 0.00116911 34.71814 0.4421320 264.4231
82 0.3146070 0.250000 9.36 0.00085438 31.22779 0.3863590 250.8361
83 0.2678570 0.600000 6.76 0.00123431 34.81605 0.0316053 226.9470
84 0.4955750 0.401786 9.97 0.00121549 37.70006 0.6194210 225.8647
85 2.1212101 0.442857 5.38 0.00122210 40.82454 0.6434820 226.8245
86 0.4364410 0.436893 7.32 0.00129761 34.96204 0.4038900 193.6432
87 1.0689700 0.322581 6.18 0.00081426 44.29367 0.3304480 299.4956
88 0.7692310 0.507692 10.64 0.00103525 34.00304 0.6105400 253.5926
89 0.4126980 0.487179 13.18 0.00127115 35.97390 0.3985150 257.9737
wtuc wtrd wfir wser wmfg wfed wsta wloc
1 548.3239 354.6761 509.4655 354.3007 494.30 568.40 329.22 379.77
2 379.5547 189.1807 278.0352 230.4981 275.72 419.07 400.59 313.55
3 392.0999 220.4530 363.2880 292.7027 464.49 548.49 421.36 319.08
4 548.9865 238.9154 435.1107 391.3081 646.85 563.77 415.51 362.58
5 489.3144 308.5762 420.8864 305.1543 448.86 563.72 426.47 333.64
6 595.3719 240.3673 348.0254 295.2301 358.95 509.43 359.11 339.58
7 528.5593 306.0835 430.0697 348.2754 444.45 597.95 453.08 362.99
8 412.4720 213.7524 324.8357 257.3344 441.72 433.94 367.34 333.71
9 469.2220 277.2925 359.3117 296.8491 341.32 525.51 360.68 333.32
10 433.8580 228.1740 363.7671 318.3635 378.90 496.13 381.30 335.36
11 384.6154 220.5897 358.6328 318.0335 355.59 486.36 411.08 357.44
12 408.7245 221.2701 453.1722 274.1775 334.54 477.58 292.09 311.91
13 418.6380 225.5296 338.5699 261.3512 324.67 496.07 325.15 314.01
14 420.8830 179.1289 389.8522 292.2253 388.75 509.95 499.59 333.05
15 475.3228 260.2710 329.5464 265.4315 374.41 491.16 346.81 351.74
16 403.0558 248.7759 301.8632 293.1148 367.42 463.37 352.35 320.82
17 501.9174 214.2170 334.1840 258.3238 443.70 496.28 308.01 305.71
18 400.8570 230.9888 320.0345 238.4958 295.26 334.55 375.45 327.62
19 377.9356 246.0614 411.4330 296.8684 392.27 480.79 303.11 337.28
20 318.9644 183.2609 265.1232 230.6581 258.25 326.10 329.43 301.64
21 401.1326 237.8523 323.9486 274.8686 352.08 463.11 267.78 343.13
22 374.0296 226.4881 335.4204 230.3086 320.20 453.61 389.34 302.93
23 365.4716 279.2273 325.0271 213.5822 290.69 453.53 317.23 286.45
24 421.6986 222.9622 338.7534 282.9701 346.51 480.14 351.17 330.93
25 379.8962 205.4827 377.6978 274.6765 390.01 543.21 464.63 330.78
26 400.7398 213.1840 325.0271 315.0242 327.45 463.67 361.21 308.32
27 409.7771 192.4181 302.6162 245.7938 418.48 478.48 342.13 318.07
28 213.6752 175.1604 267.0940 204.3792 193.01 334.44 414.68 304.32
29 350.1326 202.9344 333.9183 276.2629 350.63 496.75 332.12 324.97
30 428.5023 213.4620 316.9436 292.3517 309.27 450.28 283.60 310.85
31 426.3901 257.6008 441.1413 305.7612 329.87 508.61 380.30 329.71
32 441.5954 201.3569 302.6724 260.5459 264.25 452.59 348.05 285.47
33 361.1018 241.0034 342.6819 270.4866 349.63 459.32 387.16 376.45
34 419.0362 221.0995 326.8283 253.0096 351.09 463.69 312.53 311.47
35 415.1317 218.7198 322.4150 278.1124 294.37 474.26 298.66 294.72
36 540.1061 209.0579 316.2955 216.4589 313.71 543.03 348.88 329.16
37 412.5212 224.7224 339.9547 253.6207 288.32 452.53 341.77 324.06
38 406.5041 212.4851 312.6954 304.3066 375.32 474.30 297.69 290.48
39 407.0929 200.2213 337.9095 262.9849 278.82 441.49 381.46 304.73
40 354.2510 180.9359 369.4332 253.2281 304.72 427.84 451.79 297.19
41 448.7180 198.6254 248.1390 264.6729 310.86 437.03 397.54 327.37
42 471.3424 203.0162 294.4124 267.6852 374.25 442.38 381.14 285.50
43 376.2542 196.0101 258.5650 192.3077 300.38 409.83 362.96 301.47
44 335.4590 210.3365 317.3077 316.6645 316.57 428.33 388.92 307.25
45 402.5045 201.5675 279.2492 251.4202 334.92 450.28 277.60 302.83
46 378.1590 203.1547 307.2585 232.1825 325.03 459.90 350.27 317.12
47 294.6650 192.8994 267.3797 237.1590 301.29 467.08 350.24 302.25
48 341.8803 182.8020 348.1432 212.8205 322.92 391.72 385.65 306.85
49 437.9284 196.5065 288.4615 243.4706 588.99 488.76 346.32 294.21
50 571.5438 179.5039 327.9666 251.4270 260.35 384.15 354.41 304.91
51 457.5350 199.5847 299.7388 320.1325 277.68 447.87 361.24 300.77
52 480.1948 268.3836 365.0196 295.9352 295.63 468.26 337.88 348.74
53 588.6970 190.5555 331.5650 206.6858 306.42 406.62 348.37 306.68
54 480.7692 200.3768 297.2437 239.2233 360.27 466.61 367.24 302.63
55 538.8488 200.7432 302.1978 219.6343 334.65 414.63 298.78 295.00
56 379.0005 238.5589 271.7391 232.5916 332.07 451.84 389.99 312.05
57 391.7379 197.6995 311.3553 199.4458 360.21 512.30 369.75 329.34
58 508.2035 266.7794 466.0016 347.6609 560.78 516.05 381.03 388.09
59 402.5045 172.6453 317.3394 257.8734 627.02 344.06 357.69 298.33
60 431.9210 198.0064 253.3395 266.4674 316.19 430.64 314.75 297.34
61 320.5128 171.0329 334.4482 250.0000 218.30 353.31 385.19 323.08
62 401.3378 187.8255 258.5650 237.1507 258.60 391.48 325.71 275.22
63 187.6173 161.3771 254.5249 182.0196 298.78 471.09 400.11 292.98
64 437.0629 188.7683 353.2182 210.4415 289.43 421.34 342.92 301.23
65 424.7286 167.0511 282.4099 229.0151 327.29 383.88 360.66 302.03
66 613.2261 191.2452 290.5141 266.0934 567.06 403.15 258.33 299.44
67 342.5724 194.6777 352.6135 256.4102 355.83 426.56 313.71 313.84
68 417.2099 168.2692 301.5734 247.6291 258.99 442.76 387.02 291.44
69 377.3126 206.8215 289.3125 215.1933 290.89 377.35 367.23 342.82
70 415.2824 196.5514 279.0347 238.6300 295.07 429.27 318.02 299.92
71 202.4292 219.7802 305.9441 223.8502 250.42 371.79 383.72 296.64
72 286.4157 195.1995 368.2488 172.4733 324.45 357.16 407.54 268.44
73 366.3004 205.5358 310.1737 259.3391 303.42 449.84 350.72 283.76
74 462.4408 227.7858 305.6546 231.3615 321.90 460.62 393.29 321.53
75 445.2762 189.7436 284.5933 221.3903 319.21 338.91 361.68 326.08
76 308.5150 214.2203 327.1213 203.8864 266.58 414.84 333.46 308.05
77 396.2704 193.0723 272.2941 242.4605 277.34 345.09 328.00 325.77
78 397.6901 191.1720 281.0651 256.7214 281.80 412.15 328.27 299.03
79 390.1895 180.0634 295.8580 246.0152 270.78 397.33 313.06 239.17
80 412.0879 154.2090 256.4102 265.1301 291.10 337.09 374.11 246.65
81 421.3483 170.5293 282.0513 183.1502 297.14 390.94 356.91 267.08
82 373.9316 178.3723 234.5216 133.0431 157.41 380.97 388.43 253.66
83 372.2084 229.3209 305.9441 209.6972 237.65 358.98 331.53 281.37
84 375.2345 220.9747 307.6923 172.6281 278.70 432.81 370.81 259.78
85 331.5650 167.3726 264.4231 266.7583 247.72 381.33 367.25 300.13
86 346.6011 202.9595 268.3363 208.2520 339.76 389.51 322.06 278.39
87 356.1254 170.8711 170.9402 250.8361 192.96 360.84 283.90 321.73
88 353.2182 199.2377 356.1254 206.2816 235.05 416.49 370.62 297.13
89 388.3136 179.7050 263.7363 196.1453 255.56 374.78 380.29 294.98
mix pctymle
1 0.16869897 0.07916495
2 0.21999694 0.07647973
3 0.07871422 0.08109921
4 0.07585382 0.09468981
5 0.10255422 0.08310476
6 0.10186080 0.07939028
7 0.08527010 0.09935585
8 0.10474319 0.14223780
9 0.18122160 0.07891140
10 0.06289308 0.07219726
11 0.08848315 0.07641540
12 0.08016878 0.07787097
13 0.07497714 0.07873024
14 0.05091770 0.13302912
15 0.09146758 0.07705218
16 0.08594864 0.08013537
17 0.09090909 0.06795343
18 0.08616445 0.08828809
19 0.15612382 0.07945071
20 0.12176319 0.24871162
21 0.10368066 0.07365920
22 0.05424063 0.08050946
23 0.10003893 0.07977433
24 0.08431085 0.07655433
25 0.09345226 0.11421655
26 0.08400646 0.08397806
27 0.12467756 0.07771273
28 0.41975310 0.07462687
29 0.11912815 0.07455523
30 0.08957055 0.08353864
31 0.06305506 0.07400288
32 0.34879407 0.09625563
33 0.13481072 0.08703093
34 0.10925926 0.07687439
35 0.07342084 0.07763254
36 0.09364294 0.07622346
37 0.08207343 0.07735254
38 0.10692308 0.12224479
39 0.12516960 0.08183564
40 0.05719921 0.15092644
41 0.14886613 0.10694169
42 0.09595300 0.08548180
43 0.03022670 0.08260694
44 0.11616162 0.07756928
45 0.11024390 0.07362188
46 0.09447166 0.07609449
47 0.16323297 0.07570874
48 0.06756757 0.07419893
49 0.09794629 0.07600891
50 0.06763285 0.06861899
51 0.07110778 0.07415335
52 0.11050157 0.07819394
53 0.13720317 0.08280677
54 0.09106830 0.07521905
55 0.22533333 0.07275662
56 0.09872611 0.06355526
57 0.17905165 0.08345764
58 0.07977737 0.08181948
59 0.10230179 0.08436658
60 0.09869203 0.08729226
61 0.28078818 0.08120255
62 0.31952664 0.09891920
63 0.18528996 0.08633697
64 0.11682243 0.06215772
65 0.07485030 0.07632116
66 0.05334728 0.07713232
67 0.12405757 0.07381025
68 0.01960784 0.12894706
69 0.06008584 0.07069755
70 0.12108560 0.07472797
71 0.08045977 0.08476309
72 0.15112540 0.08005202
73 0.15237226 0.07073344
74 0.08100930 0.07154077
75 0.08437271 0.07613807
76 0.16707317 0.07579902
77 0.31135532 0.08119376
78 0.27362204 0.07353726
79 0.13744076 0.06973287
80 0.05128205 0.09171820
81 0.29048842 0.07794872
82 0.21212120 0.06769374
83 0.46511629 0.07211538
84 0.16725978 0.08356434
85 0.04968944 0.07008217
86 0.21818182 0.07769163
87 0.06870229 0.07098370
88 0.23495702 0.07430546
89 0.20000000 0.07572646
ggcorr(myData, nbreaks=8, palette='RdBu', label=TRUE, label_size=5, label_color='white')